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1  Random  Problems 

A  problem  (a  Boolean  function  /  :  {0, 1}N  »-*•  {0, 1})  is  characterized  by  its 
randomness  (a  la  Kolmogorov )  R{f)  and  its  entropy  (a  la  Shannon)  H[f). 
Random  problems  have  large  values  of  R{f),  and  are  a  good  model  for 
many  natural  pattern  recognition  problems.  R{f)  and  H(f )  are  shown  to 
be  lower  and  upper  bounds,  respectively,  for  a  minimum-size  circuit  that 
computes  /.  False  entropy,  namely  the  hidden  structure  of  a  problem,  is 
related  to  the  difference  between  H(f)  and  R{f). 

1.1  Introduction 

It  is  difficult  to  find  good  mathematical  models  for  many  natural  problems 
such  as  pattern  recognition.  Not  only  does  this  difficulty  preclude  finding 
good  solutions  for  these  problems,  but  it  also  precludes  estimating  their 
complexity  using  the  standard  tools  of  the  theory  of  computational  com¬ 
plexity  (Traub,  1985).  Part  of  the  difficulty  can  be  traced  to  symptoms 
such  as  ill-definition,  fuzziness,  and  inexactness.  However,  the  difficulty  of 
modeling  these  problems  may  be  inherent  in  some  cases.  'To  illustrate  what 
we  mean,  consider  the  following  problem: 

A.  Input:  255-45-5237  Output:  Is  this  the  social  security  number  of  a 
convict? 

To  solve  this  problem  in  general,  one  needs  a  list  of  the  social  security 
numbers  of  all  convicts.  It  is  highly  improbable  that  there  exists  a  simple 
relation  between  the  social  security  number  and  the  legal  status  of  a  person. 
Barring  such  a  relation,  one  cannot  hope  for  an  algorithm  to  ‘manipulate’ 
the  input  so  as  to  arrive  at  the  output.  In  other  words,  the  above  list 
cannot  be  compacted  into  a  simple  algorithm.  Contrast  this  problem  with 
the  following  familiar  problem: 

B.  Input:  255,455,237  Output:  Is  this  number  a  prime? 
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Although  one  can  resort  to  consulting  a  list  of  primes,  there  is  the  option 
of  writing  a  simple  algorithm  to  test  for  primality  and  applying  it  to  this 
number.  It  may  take  a  long  time  to  execute,  but  the  algorithm  itself  is  «tion — _ - 
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short  in  comparison  with  the  list  of  primes.  What  is  the  basic  difference 
between  problems  A  and  B?  The  notion  of  a  prime  has  a  short  effective 
definition,  while  the  notion  of  a  convict  does  not.  The  long  list  of  social 
security  numbers  of  convicts  is  the  effective  definition  of  the  notion  of  a 
convict. 

Problems  which  do  not  have  a  concise  effective  definition  are  called 
random  problems  (Abu-Mostafa,  1985).  Randomness  here  is  based  on  the 
length  of  the  shortest  algorithm  (Kolmogorov,  1965),  and  has  nothing  to  do 
with  probability  or  fuzziness.  If  this  length  exceeds  a  certain  threshold,  the 
problem  is  considered  random.  On  the  other  hand,  for  structured  problems 
such  as  B,  the  algorithm  can  be  quite  short.  The  difficulty  in  modeling 
random  problems  is  inherent.  This  is  because  an  effective  model  could  be 
viewed  as  an  algorithm  (not  necessarily  a  very  efficient  one),  and  hence  has 
to  be  long  in  the  case  of  a  random  problem. 

Many  natural  pattern  recognition  problems  can  be  considered  random 
problems.  This  fact  is  usually  overlooked  due  to  the  apparent  ‘structure’ 
some  of  these  problems  have,  e.g.,  visual  images  which  have  many  clear  reg¬ 
ularities.  However,  after  these  regularities  are  considered,  a  major  random 
component  is  left.  A  complete  model  for  visual  scenes,  or  an  algorithm  for 
computer  vision,  will  cover  the  random  as  well  as  the  structured  compo¬ 
nents  of  the  problem,  and  hence  will  have  to  be  sufficiently  long. 

Three  factors  contribute  to  the  significance  of  random  problems  and 
widen  their  scope.  First  of  all,  the  definition  of  algorithmic  randomness  is 
based  on  universal  Turing  machines  (Turing,  1936)  which  are  more  powerful 
than  any  physical  system.  This  makes  some  problems  which  are  not  truly 
random  look  random  for  all  practical  purposes  and  will  have  to  be  treated 
as  such.  The  second  factor  is  that  there  is  no  constructive  way  in  gen¬ 
eral  to  find  the  shortest  algorithm  for  an  arbitrary  problem  (Kolmogorov, 
1965).  In  spite  of  its  generality,  this  fact  reflects  the  difficulty  of  modeling 
and  suggests  that  some  problems  will  have  be  treated  as  random  problems 
just  because  no  one  will  be  able  to  find  their  concise  model.  Finally,  al¬ 
though  it  may  be  logically  impossible  to  tell  whether  a  problem  is  random 
(Chaitin,  1982),  a  problem  generated  by  probabilistic  means  is  very  likely 
to  be  random.  Therefore,  the  assumption  that  a  natural  problem  is  random 
is  probably  valid. 
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This  last  remark  leads  to  another  observation.  Most  of  the  practical 
random  problems  turn  out  to  be  ill-defined  as  well.  However,  we  maintain 
separation  of  concerns  in  this  paper.  Only  randomness,  as  defined  above, 
is  assumed  in  the  problems  we  address  here.  Our  aim  is  to  characterize 
random  problems  and  their  computational  demands. 

In  section  2,  we  introduce  the  formal  definitions  of  randomness  and 
entropy  and  prove  some  basic  facts  about  them.  In  section  3,  we  relate 
these  quantities  to  the  complexity  of  implementing  the  function.  Finally, 
we  draw  some  insight  into  the  class  of  random  problems  in  section  4.  We 
shall  restrict  ourselves  to  binary  alphabets  for  simplicity;  the  generaliza¬ 
tion  to  arbitrary  finite  alphabets  is  straightforward.  All  logarithms  and 
exponentials  are  to  the  base  2. 

1.2  Randomness  and  entropy 

Let  N  be  an  arbitrary  fixed  positive  integer  and  consider  the  Boolean  func¬ 
tions  of  the  form  /  :  {0, 1}N  i->  {0,1}.  Any  such  function  is  fully  character¬ 
ized  by  its  truth  table  which  can  be  listed  as  a  (2Ar-bit  long)  binary  string 
r(/)  =  tqtx  •  •  •  Tfc  •  •  •  t2n_j,  where  r*  is  the  value  of  /  when  the  argument  is 
the  IV-bit  binary  representation  of  the  number  k. 

Let  U  denote  a  fixed  universal  Turing  machine  with  binary  input  alpha¬ 
bet  {0, 1}.  U  takes  a  binary  string  p  as  input  (program)  and  runs  on  p  to 
produce  an  output  string  s  (if  it  eventually  halts).  In  this  case,  we  say  that 
Lr(p)  =  s.  Based  on  U,  the  Kolmogorov  complexity  of  a  string  s  is  defined 
by 

K(s)  =  min  {|p|  |  U(p)  =  s}, 

where  |p|  denotes  the  length  of  the  string  p. 

The  Kolmogorov  complexity  measures  the  degree  of  randomness  of  a 
string;  if  two  binary  strings  of  the  same  length  have  different  Kolmogorov 
complexities,  the  one  with  the  higher  complexity  is  more  ‘random’.  The 
randomness  of  a  Boolean  function  /  is  based  on  the  Kolmogorov  complexity 
of  its  truth  table; 

R(f)  =  log  K  (r(f))  bits, 

where  the  logarithm  (to  the  base  2)  is  taken  to  make  the  range  of  R{f )  run 
from  ss  0  (where  p  is  a  very  short  program,  hence  r(/)  has  a  very  regular 


pattern)  to  «  N  (where  p  is  as  long  as  r(/),  hence  r(/)  has  no  pattern 
whatsoever). 

A  problem  will  be  considered  random  if  the  corresponding  function  / 
has  a  large  value  of  R(f).  How  large?  We  fix  a  threshold  Ro  and  make  the 
definition  relative  to  Ro-  The  choice  of  a  particular  Rq  is  not  critical  to  most 
of  the  theory,  and  may  therefore  be  motivated  by  practical  considerations. 

Definition.  A  problem  /  :  {0,  l}^  f— ►  {0, 1}  is  said  to  be  random  if  R{f)  > 
Ro- 

Since  R{f)  can  be  at  most  «  N,  the  threshold  Ro  should  be  smaller  than 
N.  This  is  necessary  to  make  the  definition  interesting,  i.e.,  to  guarantee 
that  some  of  the  problems  are  indeed  random.  In  fact,  the  overwhelming 
majority  of  all  problems  will  be  random  even  if  Rq  is  only  a  few  bits  smaller 
than  N.  To  see  this,  we  observe  that  there  are  22*  problems  (Boolean 
functions  of  N  variables),  while  there  are  at  most  2°  +  21  +  •  •  -  +  2k  <  2k+1 
programs  p  of  length  <  K.  Therefore,  At  most  22*0+1  problems  can  be  non- 
random,  and  this  is  only  a  negligible  fraction  of  22* . 

On  the  other  hand,  if  Ro  is  not  very  small,  it  will  be  impossible  to  pin¬ 
point  a  specific  problem  and  prove  that  it  is  random.  This  is  a  consequence 
of  Chaitin’s  version  of  Godel’s  incompleteness  theorem;  there  is  a  number 
Kq  (depends  on  the  axiomatic  system)  such  that  no  statement  of  the  form 
l/sT(s)  >  Ko'  is  provable  (within  the  system).  If  we  pick  Ro  >  logK0,  where 
K0  corresponds  to  axiomatic  set  theory,  it  will  be  impossible  to  prove  (using 
regular  mathematics)  that  any  given  problem  is  random. 

The  difficulty  of  proving  randomness  for  specific  problems  is  by  no 
means  a  serious  drawback  for  the  notion  of  random  problems.  There  are 
many  cases  where  the  probability  that  the  problem  is  random  is  sufficiently 
high  to  warrant  treating  it  as  a  random  problem,  in  spite  of  the  lack  of  a 
proof  of  randomness.  In  fact,  whenever  probability  is  involved  in  generating 
/,  the  chances  are  /  will  be  a  random  problem. 

Example.  Let  /  :  {0,1}^  >— ►  {0,1},  where  N  is  large,  be  generated  as 
follows.  Each  bit  of  the  truth  table  r(f)  =  r0 — r2N_j  is  (independently) 
set  to  1  with  probability  t  and  to  0  with  probability  1  -  e,  where  0  <  e  <  1. 
The  expected  number  of  l’s  in  r(f )  is  therefore  (2N.  With  high  probability 
(law  of  large  numbers),  /  will  have  about  that  many  l’s  in  its  truth  table. 
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Therefore,  /  will  be  any  of  «  (£22n)  functions  with  approximately  equal 

probability.  This  number  can  be  estimated  as  w  where  H(e)  = 

—  clog  e  —  (1  —  e)  log(l  —  e).  We  can  again  enumerate  the  programs  p  of 
length  0,  1,  etc.,  and  conclude  that  R(f)  >  N  —  A  with  high  probability, 
where  A  is  only  a  few  bits  more  than  -  log  #(e). 

This  example  also  illustrates  that  the  randomness  of  a  problem  is  af¬ 
fected  by  the  number  of  l’s  in  the  truth  table.  If  the  number  of  l’s  is 
very  small,  one  can  write  a  short  program  p  to  generate  r(/)  by  specifying 
where  the  l’s  are  in  r(/).  The  same  can  be  done  in  the  dual  case  where  the 
number  of  0’s  is  small.  Problems  which  have  few  l’s  (or  few  0’s)  in  their 
truth  tables  are  of  special  interest  because  they  model  the  cases  where  only 
a  small  fraction  of  inputs  to  /  are  of  interest,  a  condition  commonly  en¬ 
countered  in  natural  problems.  The  quantity  that  captures  this  property 
is  entropy. 

Let  h[f)  =  min{  |/_1(l)|  ,  |/— 1  (0)  [  },  i.e.,  the  number  of  l’s  or  the  num¬ 
ber  of  0’s  (whichever  is  smaller)  in  the  truth  table  of  f.  The  (deterministic) 
entropy  of  /  is  defined  by 

H{f)  =  log(l  +  h[f))  bits. 

The  logarithm  (and  the  added  1)  make  the  range  of  H{f)  run  from  0  (the 
two  constant  functions)  to  «  JV  (functions  with  as  many  l’s  as  0’s).  Hence, 
//(/)  has  essentially  the  same  range  as  R{f)- 

Except  for  a  small  ‘error’  term  (at  most  the  order  of  logjV),  H[f ) 
serves  as  an  upper  bound  for  R[f).  To  see  this,  we  assume  that  r{f)  has 
only  a  limited  number  of  l’s  (the  dual  case  of  0’s  is  similar)  and  write  a 
relatively  short  program  p  that  generates  r(/).  The  program  is  a  listing  of 
the  locations  of  the  l’s  in  r(f).  Since  each  location  in  r(/)  can  be  specified 
by  N  bits,  the  length  of  p  is  approximately  N The  logarithm  of  that, 
which  is  an  upper  bound  for  R{f),  is  H(f)  4-  log  N.  An  enumeration  of  all 
programs  of  length  0,  1,  etc.,  and  of  the  number  of  functions  of  a  certain 
level  of  entropy  shows  that  H (/)  is  a  tight  upper  bound  for  R(f).  In  fact, 
for  most  functions,  R{f)  «  H{f). 

It  is  interesting  to  notice  that  R{f)  and  H(f)  can  be  considered  two 
extremes  in  a  spectrum  of  quantities  that  measure  the  complexity  of  spec- 
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ifying  /  based  on  models  of  varying  degree  of  sophistication.  On  the  one 
hand,  H(f)  mesures  the  complexity  of  specifying  /  if  the  specification  is 
done  by  simply  listing  the  l’s  or  the  0’s  of  the  function.  Hence,  the  model 
on  which  the  measure  H(f )  is  based  is  the  ‘lookup’  model.  On  the  other 
hand,  the  model  on  which  R(f)  is  based  is  the  universal  Turing  machine. 
R(f )  measures  the  complexity  of  specifying  /  based  on  a  very  powerful  tool 
that  generates  r[f)  from  the  specification.  Hence  R(f)  is  as  small  as  can 
be;  it  can  take  advantage  of  any  effective  property  /  may  have.  There  are 
many  models  that  fall  between  these  two  extremes.  For  example,  a  time- 
bounded  version  of  R(f)  can  be  based  on  the  time-bounded  Kolmogorov 
complexity  of  r(/).  The  underlying  model  will  be  a  universal  Turing  ma¬ 
chine  that  is  allowed  to  run  for  only  a  limited  number  of  steps.  Another 
model  which  is  treated  in  more  detail  in  section  3  is  combinational  circuits, 
where  regularities  of  /  that  can  be  captured  by  logic  elements  help  reduce 
the  size  of  the  specification  of  /. 

The  difference  between  H[f )  and  R[f)  is  a  significant  quantity.  It  is 
called  false  entropy  (Abu-Mostafa,  1986)  and  expresses  how  much  ‘hidden’ 
structure  the  problem  has.  This  interpretation  follows  from  the  above  dis¬ 
cussion;  the  model  on  which  H(f )  is  based  is  the  lookup  model  that  does 
not  take  any  advantage  of  the  location  of  the  l’s  and  0's  of  /,  just  their 
number,  while  the  model  on  which  R{f)  is  based  takes  advantage  of  any 
effective  regularity,  no  matter  how  subtle,  to  reduce  the  size  of  the  spec¬ 
ification.  The  difference  expresses  how  much  of  the  entropy  of  /  can  Ke 
removed  if  the  structure  of  /  is  taken  into  consideration.  The  problem 
of  pattern  recognition  hinges  on  removing  as  much  of  the  false  entropy  as 
possible. 

1.3  Complexity  bounds 

Consider  the  problem  of  implementing  the  function  /,  e.g.,  using  a  combi¬ 
national  circuit  (Savage,  1976).  We  wish  to  estimate  the  complexity  of  such 
implementation,  and  investigate  the  relation  between  complexity,  entropy, 
and  randomness.  Since  N  is  fixed,  the  number  of  input  instances  is  finite 
and  our  measure  of  complexity  will  be  a  nonuniform  one,  i.e.,  not  based  on 
a  finite  process  that  works  for  any  of  an  infinite  number  of  input  instances. 
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Nonuniformity  of  the  complexity  measure  in  our  context  is  more  realistic 
for  two  reasons.  First,  the  pattern  recognition  problems  we  are  modeling 
are  finite  in  nature  with  no  clear  extension  into  an  infinite  problem.  Sec¬ 
ondly,  the  systems  that  are  projected  for  pattern  recognition  are  based  on 
learning,  i.e.,  automatic  development  of  the  system  from  training  samples. 
As  such,  the  final  system  does  not  have  to  be  uniform  although  the  learning 
mechanism  itself  may  be  uniform. 

There  are  several  ways  of  defining  circuits  all  of  which  are  equivalent, 
and  a  corresponding  number  of  ways  of  defining  circuit  complexity  (size)  all 
of  which  are  within  a  constant  (independent  of  N  and  /)  from  one  another. 
For  concreteness,  we  give  one  such  definition  in  detail.  Our  circuits  are 
combinational  (loop-free),  with  unlimited  fan-out  (an  output  of  a  gate  can 
be  used  as  an  input  to  any  number  of  gates).  The  only  type  of  gate  we  use 
is  the  two-input  NAND  gate  (whose  output  is  0  if,  and  only  if,  both  inputs 
are  l’s)  which  by  itself  is  a  complete  basis  (any  Boolean  function  can  be 
simulated  using  a  circuit  consisting  exclusively  of  copies  of  this  gate).  The 
independent  Boolean  variables  (inputs  of  /)  are  called  X\,”  •  ,xN,  and  are 
available  to  be  used  as  inputs  to  any  gate  in  the  circuit. 

A  circuit  is  a  chain  T  =  71  •  •  •  7$  of  Q  gates.  The  outputs  of  these  gates 
are  called  ylt  •  •  •  ,yQ,  respectively.  Each  gate  7,  can  have  as  inputs  any  of 
the  independent  variables  Xj,  •  •  • ,  Xs  as  well  as  the  outputs  of  the  previous 
gates  y  1 ,  *  ■  *  ,yq~i .  Since  all  the  gates  are  two-input  NAND  gates,  we  only 
need  to  specify  the  inputs  to  each  7,.  Therefore,  formally,  each  7,  is  a  pair 
(not  necessarily  distinct,  order  doesn’t  matter)  of  elements  from  the  set 
{27,  •  •  • ,  xN,yu  ■  ■  ■ ,  y?-i}.  Finally,  the  output  of  the  circuit  is  the  output  of 
the  last  gate,  yq.  We  say  that  a  circuit  T  simulates  a  function  /  if  /  =  yq 
for  all  assignments  of  the  variables  27,  •  •  • ,  x^.  The  number  of  gates  in  T, 
namely  Q,  is  denoted  by  c(T).  The  (circuit)  complexity  of  a  problem  /  is 
defined  by 

C{f)  =  logmin{c(r)  j  T  simulates  /}  bits. 

Again  we  have  the  range  of  C(f )  running  from  ss  0  (simple  functions)  to 
«  N  (complex  functions).  The  fact  that  the  maximum  value  C(f)  can 
assume  is  «  N  follows  from  the  exhaustive  implementation  of  any  Boolean 
function  of  N  variables  that  uses  ss  2N  gates  (which  can  be  improved  to 
2 N /N  due  to  a  classical  result  of  Shannon).  We  now  show  that,  except  for 
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an  error  term  of  at  most  the  order  of  log  N,  the  value  C(f)  is  at  least  R(f ) 
and  at  most  H(f),  which  we  write  as 


R(l)  <  C(f)  <  H(f). 

This  relationship  reflects  the  fact  that  combinational  logic  is  more  sophis¬ 
ticated  than  table  lookup,  but  less  sophisticated  than  universal  Turing  ma¬ 
chines. 

To  see  that  C(f)  <  H(f ),  consider  the  case  where  /  has  h{f)  l’s  (the 
dual  case  is  similar).  One  can  build  a  circuit  with  N  one-input  NOT  gates, 
h(f)  iV-input  AND  gates,  and  one  h(f)- input  OR  gate  to  simulate  /  (im¬ 
plementing  the  l’s  of  the  function  directly  by  a  sum  of  products).  One  can 
replace  all  these  gates  by  at  most  aN(  1  +  h(f))  two-input  NAND  gates 
(where  a  is  a  suitable  constant).  Hence  C(/)  <  H(f)  +  log  N  +  constant. 
Therefore,  apart  from  the  error  term,  the  bound  is  valid. 

To  see  that  C(f)  >  i2(/),  consider  a  program  p  that  encodes  a  minimum- 
size  circuit  T  =  71  •  •  •  7q  that  simulates  f.  Each  7,  is  a  pair  of  variables 
selected  from  at  most  N  +  Q  variables.  Hence  it  takes  at  most  2  \og(N  +  Q ) 
bits  to  encode  each  7,.  Hence,  the  program  that  encodes  the  whole  circuit 
T  will  be  at  most  aQ\og(N  +  Q )  bits  long  (where  a  is  a  suitable  con¬ 
stant).  The  logarithm  of  that  is  an  upper  bound  for  R(f).  By  definition, 
log  Q  =  C(f).  The  other  (error)  terms  are  at  most  the  order  of  log  N  since 
Q  is  at  most  the  order  of  2N . 

The  fact  that  R(f)  is  a  lower  bound  for  C(f)  implies  that  random 
problems  cannot  be  solved  by  small  circuits. 

1.4  Conclusion 

The  notion  of  random  problems  was  introduced  to  capture  the  inherent 
difficulty  of  natural  pattern  recognition  problems  and  estimate  their  com¬ 
putational  demands.  The  paper  introduced  the  main  concepts  and  proved 
some  basic  facts  for  the  idealized  case  where  the  problem  is  defined  as  a 
deterministic  Boolean  function.  The  results  are  summarized  in  the  rela¬ 
tionship 

-<  run  <  run 


The  theory  of  rate-distortion  may  be  employed  to  accommodate  the 
practical  case  of  continuous-valued  functions.  To  accommodate  the  case 
where  there  is  a  probability  distribution  over  the  input  variables,  one  can 
define  a  probabilistic  version  of  the  measures  R{f),  H(f),  C(f ),  e.g., 

Rf{f)  =  mm{R(g)  |  Pr (/  f  g)  <  6}. 

In  words,  the  randomness  of  /  with  tolerance  for  error  6  of  the  time  is  the 
minimum  randomness  of  any  function  g  that  differs  from  /  at  most  6  of  the 
time. 

We  also  made  the  remark  that  the  difference  between  H[f )  and  R[f) 
highlights  the  hidden  structure  of  the  problem  which  a  pattern  recognition 
system  has  to  be  able  to  detect,  at  least  partially. 
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2  Higher  Order  Associative  Memories  and 
their  Optical  Implementations 

The  properties  of  higher  order  memories  are  described.  The  non-redundant, 
up  to  JV-th  order  polynomial  expansion  of  TV-dimensional  binary  vectors  is 
shown  to  yield  orthogonal  feature  vectors.  The  properties  of  expansions 
that  contain  only  a  single  order  are  investigated  in  detail  and  the  use  of 
the  sum  of  outer  product  algorithm  for  training  higher  order  memories  is 
analyzed.  Optical  implementations  of  quadratic  associative  memories  are 
described  using  volume  holograms  for  the  general  case  and  planar  holo¬ 
grams  for  shift  invariant  memories. 

2.1  Introduction 

An  associative  memory  can  be  thought  of  as  a  system  that  stores  a  pre¬ 
scribed  set  of  vector  pairs  (xm,t/m)  for  m  =  and  also  produces 

ym  as  its  output  when  xm  becomes  its  input.  We  denote  by  N  and  N0 
the  dimensionality  of  the  input  and  output  vectors  respectively.  When  the 
output  vectors  are  stored  as  binary  N0-tuples,  the  associative  memory  can 
be  implemented  as  an  array  of  discriminant  functions,  each  dichotomiz¬ 
ing  the  input  vectors  into  two  classes.  This  type  of  associative  memory  is 
shown  schematically  in  Fig.  1.  In  evaluating  the  effectiveness  of  a  partic¬ 
ular  associative  memory  we  are  concerned  with  its  ability  to  store  a  large 
number  of  associations  (capacity),  the  ease  with  which  the  parameters  of 
the  memory  can  be  set  to  realize  the  prescribed  mappings  (learning),  and 
how  it  responds  to  inputs  that  are  not  members  of  its  training  set  (general¬ 
ization).  In  this  paper  we  discuss  a  class  of  associative  memories  known  as 
higher  order  memories  that  have  been  recently  investigated  by  a  number  of 
separate  research  groups[l,2,3,4,5,6,7,8].  Our  motivation  for  investigating 
these  memories  was  the  increase  in  storage  capacity  that  results  from  the 
increase  in  the  number  of  independent  papameters  or  degrees  of  freedom 
that  is  needed  to  describe  a  higher  order  associative  mapping.  The  rela¬ 
tionship  between  the  degrees  of  freedom  of  a  memory  and  its  ability  to  store 
associations[9j  is  fundamental  to  this  work  and  we  state  it  in  the  following 
subsection  as  a  theorem. 


2.1.1  Degrees  of  freedom  and  storage  capacity 


Let  D  be  the  number  of  independent  variables  (degrees  of  freedom)  we  have 
under  our  control  to  specify  input-output  mappings  and  let  each  parameter 
have  K  separate  levels  or  values  that  it  can  assume.  We  define  the  storage 
capacity  C  to  be  the  maximum  number  of  arbitrary  associations  that  can 
be  stored  and  recalled  without  error. 


Theorem  1 

Dlog^ 

"  No 


(1) 


Proof  :  The  number  of  different  states  of  the  memory  is  given  by  KD  and 
the  total  number  of  outputs  that  a  given  set  of  M  input  patterns  can  be 
mapped  to  is  2NoM.  If  the  number  of  mappings  were  larger  than  the  number 
of  distinct  memories  then  mappings  would  exist  that  are  not  implementable. 
Requiring  that  all  mappings  can  be  done  leads  to  the  relationship  of  the 
theorem. 

The  equality  in  Eq.  1  is  achieved  by  Boolean  circuits  such  as  programmable 
logic  arrays  and  an  extreme  case  of  a  higher  order  memory  we  will  discuss 
later.  When  the  equality  holds,  resetting  any  one  bit  in  any  one  of  the 
parameters  of  the  memory  gives  a  different  mapping.  Such  a  memory  can¬ 
not  learn  from  the  training  set  to  respond  in  some  desirable  way  to  inputs 
that  it  has  never  seen  before.  The  only  way  to  get  generalization  when 
C  =  D  log2  K/No  is  to  impose  on  it  the  overall  structure  of  the  memory  be¬ 
fore  learning  begins.  One  of  the  appealing  features  of  neural  architectures 
is  the  considerable  redundancy  in  the  degrees  of  freedom  that  is  typically 
available.  Therefore,  there  is  hope  that  while  a  memory  learns  specific 
input-output  correspondances  it  can  also  discover  the  underlying  structure 
that  may  exist  in  the  problem  and  learn  to  respond  correctly  for  a  set  of 
inputs  much  larger  than  the  training  set.  Moreover  the  same  redundancy 
is  responsible  for  the  error  tolerance  that  is  evident  in  many  neural  ar¬ 
chitectures.  Higher  order  memories  are  generally  redundant  and  they  can 
provide  us  with  a  methodology  for  selecting  the  degree  of  redundancy  along 
with  the  number  of  degrees  of  freedom  and  the  associated  capacity  to  store 
random  problems. 


It  is  important  to  keep  in  mind  that  Eq.  1  holds  for  arbitrary  mappings. 
If  the  input  and  output  vectors  are  restricted  in  some  way  that  happens  to 
be  matched  to  the  architecture  of  a  particular  associative  memory  then  it 
may  be  possible  to  overcome  this  limit.  However,  selecting  the  architecture 
of  the  associative  memory  such  that  it  optimally  implements  only  a  subset  of 
all  possible  associations  is  basically  equivalent  to  choosing  the  architecture 
so  that  it  generalizes  in  a  desirable  way.  For  instance  suppose  that  we 
design  an  associative  memory  so  that  it  is  shift  invariant  (i.e.  the  output 
is  insensitive  to  a  change  in  the  position  of  the  input) [4,10].  Then  this 
system  will  respond  predictably  to  all  the  shifted  versions  of  the  patterns 
that  were  used  to  train  it.  We  can  equivalently  think  of  this  system  as 
having  a  larger  storage  capacity  than  the  limit  of  Eq.  1  over  the  set  of  shift 
invariant  mappings.  If  we  can  identify  a  priori  the  types  of  generalization 
we  wish  the  memory  to  exhibit  and  we  can  find  ways  to  impose  these  on 
the  architecture  then  this  is  certainly  a  sensible  thing  to  do.  Higher  order 
memories  can  also  provide  a  convenient  framework  within  which  this  can 
be  accomplished. 

The  penalty  we  must  pay  for  the  increase  in  the  storage  capacity  that  is 
afforded  by  the  increase  in  the  degrees  of  freedom  in  a  higher  order  assoca- 
tive  memory  is  increase  in  implementation  complexity.  The  computer  that 
implements  a  higher  order  memory  must  have  sufficient  storage  capacity 
to  store  a  very  large  number  of  parameters.  Moreover  it  must  be  capable 
of  addressing  the  stored  information  with  a  high  degree  of  parallelism  in 
order  to  produce  an  output  quickly.  We  will  discuss  in  this  paper  optical 
implementations  of  second  order  memories  and  we  will  show  a  remarkable 
compatability  between  the  computational  requirements  of  these  memories 
and  the  ability  of  optics  to  store  information  in  three  dimensions. 

2.1.2  Linear  discriminant  functions  and  associative  memories 

We  will  consider  as  a  precursor  the  most  familiar  associative  memories 
that  are  constructed  as  arrays  of  linear  discriminant  functions[l4j.  A  linear 
discriminant  function  is  a  mapping  from  the  sample  space  X,  a  subset  of 
RN ,  to  1  or  —1. 

y  =  sgn{wt  ■  x  4-  u>o) 


=  sgn{w  o  +  t^1x1  +  1^2X2  +  •  •  •  +  wnxn}  (2) 

where  sgn  is  the  signum  function,  tv  is  a  weighting  vector  and  w0  is  a 
threshold  value.  In  this  case  the  capacity  is  upperbounded  by  (N  + 1)  log2  K 
according  to  our  definition  of  capacity.  In  this  relatively  simple  case  the  ex¬ 
act  capacity  is  known  to  be  equal  to  C  =  N  +  1  assuming  the  input  points 
are  in  general  position  and  K  =  oo[ll|.  An  associative  memory  is  con¬ 
structed  by  simply  forming  an  array  of  linear  discriminant  functions  each 
mapping  the  same  input  to  a  different  binary  variable.  Several  algorithms 
exist  for  training  such  memories  including  the  perceptron,  Widrow-Hoff, 
sum  of  outer  products,  pseudoinverse,  and  simplex  methods[l3,14,15,16] 
This  memory  can  be  thought  of  as  the  first  order  of  the  broader  class  of 
higher  order  memories  that  contain  not  only  a  linear  expansion  of  the  input 
vector  but  also  quadratic  and  higher  order  terms.  We  will  see  in  section  3 
that  the  learning  methods  that  are  applicable  to  the  linear  memories  gen¬ 
eralize  direcly  to  the  higher  order  memories.  First  however  we  will  describe 
the  properties  of  the  mappings  that  are  implementable  with  higher  order 
memories  in  section  2.  Finally,  in  section  4  we  will  describe  optical  imple¬ 
mentations  of  quadratic  optical  memories[2,12]. 

2.2  Properties  of  higher  order  memories 

A  ^-function  is  defined  to  be  a  fixed,  mapping  of  the  input  vector  x  to  an 
L-dimensional  vector  z  followed  by  a  linear  discriminant  function. 

y  -  sgn{y/*  •  z(x)  +  tv0} 

—  sg n{w[ z\  4-  w'2z2  +  •  •  •  +  w'lZl  +  w0}  (3) 

where  z(x)  =  ( z\ (x), z2(x),  ...  ,Zl(x)),  tv'  is  an  L  dimensional  weighting 
vector  and  z(x)  is  an  L  dimensional  vector  derived  from  x.  The  storage  ca¬ 
pacity  in  this  case  is  equal  to  the  capacity  of  the  second  layer  L+ 1  [l l]  if  the 
samples  z  are  in  general  position  whereas  the  upper  bound  on  the  capacity 
from  Eq.  1  is  (L  +  1)  log2  K.  The  inefficiency  in  this  case  is  log2  K  bits, 
the  same  as  for  the  linear  discriminant  function  even  though  the  capacity 
can  be  raised  arbitrarily  by  increasing  L.  It  is  not  known  what  the  exact 
relationship  between  L  and  K  is.  I.e.  we  do  not  know  whether  for  higher 


dimensions  we  need  better  resolution  for  the  values  of  the  weights  to  be 
capable  of  implementing  a  fixed  fraction  of  the  linear  mappings.  Recently 
Mok  and  Psaltis[l7]  have  found  the  asymptotic  (large  N )  statistical  capac¬ 
ity  to  be  C  —  N  for  a  linear  discriminant  function  with  binary  weights. 
This  result  implies  that  even  for  large  N  for  the  vast  majority  of  linear  di¬ 
chotomies,  a  large  number  of  levels  is  not  required.  Therefore  a  ^-function 
is  an  effective  and  straightforward  method  for  increasing  the  capacity  of  an 
associative  memory  without  loss  in  efficiency. 

A  higher  order  associative  memory  is  an  array  of  ^-functions  with  the 
mappings  z{x )  being  polynomial  expansions  of  the  vector  x.  The  schematic 
diagram  of  a  higher  order  associative  memory  is  shown  in  Fig.  2.  When 
the  polynomial  expansion  is  of  the  r-th  order  in  x  then  the  output  vector 
y  is  given  by 

y,  =  sgn{W[{x,x,  ...  ,  x)  +  W[~ 1  (x,  ...  ,x)  +  •••  ,  +W?(x,  r)  +  W'x  +  u;i0} 

(4) 

where  /  =  1, . . . ,  N0,  Wf  is  a  fc-linear  symmetric  mapping  and  W is  equiv¬ 
alent  to  w ‘  in  Eq.  2.  According  to  Eq.  3 

*/(i)  =  •••*>;,  (5) 

where  j  =  1,2,  ...  ,L,  E  {1,2,  ...  ,N}  for  »  =  1,  ...  ,r  such  that  all 
Zj  are  different  and  n^r^,  ...  ,nr  =  0,1.  Then  L  is  ^*^[11],  and  hence 
the  capacity  bound  is  4-  l)log2  K  as  before.  For  example,  if  r  =  2, 

the  function  becomes  quadratic  and  has  the  form  y ,  —  +  W'x  +  w/o 

and  the  number  of  nonredundant  terms  in  the  quadratic  expansion  are 
L  =  (N  +  l)(Ar  +  2)/2. 

The  components  of  the  vector  z  are  binary-valued  if  x  is  binary.  In  this 
case,  the  samples  cannot  be  assumed  to  be  in  general  position  since  there  are 
at  most  N  - 1-2  binary  vectors  in  N  dimensional  space  which  lie  in  general 
position.  We  will  evaluate  the  effectiveness  of  higher  order  mappings  in 
producing  representations  z(x)  that  are  separable  by  the  second  layer  of 
weights  by  calculating  the  hamming  distance  between  z  vectors  given  the 
hamming  distance  between  the  corresponding  x  vectors.  We  expect  that  if 
the  Hamming  distance  between  two  binary  vectors  is  large  then  they  are 
easy  to  distinguish  from  one  another. 


2.2.1  Complete  polynomial  expansion  of  binary  vectors 

There  are  at  most  2N  nonredundant  terms  in  any  polynomial  expansion 
(Eq.  4)  of  a  binary  vector  x  in  N  dimensions.  First,  we  will  consider  the 
following  iV-th  order  expansion  (or  equivalently  bit  production)  for  the 
bipolar  vectors  x  in  N  dimensional  binary  space  {1,-1}^: 

z  =  z(x)  —  •••  ) xNixix2i  •  •  •  , xi^2  •  •  •  xn)1-  (6) 

If  we  apply  a  linear  discriminant  function  to  the  new  vectors  z,  then  the 
capacity  becomes  2N  which  is  equal  to  the  total  number  of  possible  input 
vectors[2].  In  other  words  this  memory  is  capable  of  performing  any  map¬ 
ping  of  N  binary  variables  to  any  binary  output  vector  y.  Of  course  the 
number  of  weights  that  are  needed  to  implement  this  memory  grows  to  2N 
times  No,  the  number  of  bits  at  the  output.  In  what  follows  we  show  that 
in  this  extreme  case  the  vectors  z  become  orthogonal  to  each  other. 

Theorem  2  If  we  expand  binary  vectors  xm  (m  =  1,2,  ...  ,2N)  in  Xb  = 
{1,  —  1}N  to  2n  dimensional  binary  vectors  zm  according  to  Eq  6,  where  N 
is  the  dimensionality  of  the  original  feature  vectors,  then 

(a)  <  zmi,zm7  >~  2N6mimj  where  <  •,•  >  is  an  inner  product, 

(b) z<zr  =  o, 

(c)  Zm*TlzZ  =  2"6ilh  and  Zmz?  =  0. 

Example  :  The  following  table  is  for  the  case  of  N  =  3.  Note  the  orthog¬ 
onality  and  the  numbers  of  l’s  and  —  l’s  in  the  new  vectors  and  the  set 
of  each  component  of  them  except  the  first  vector  and  the  set  of  the  first 
components. 
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-1 

1 

-1 

-1 

-1 

1 

1 

1 

-1 

Table  1. 


Proof  :  (a)  Let  us  consider  any  two  different  binary  vectors  in  the  binary 
space  of  { 1 ,  —  1 } ‘v  whose  Hamming  distance  is  n  (1  <  n  <  TV).  When  they 
are  expanded  to  two  2^  dimensional  binary  vectors,  the  number  of  k-th 
order  terms  that  have  opposite  sign  in  the  two  expansions  is 


n\(N-n 

l/U-i 


fn\  ( N  -  n\  (n\  I N  -  n\ 

+  (3)1  A:  -  3  )  +  U)(  k-5  )  + 


Notice  that  two  polynomials  have  different  values  if,  and  only  if,  they  have 
odd  number  of  terms  whose  signs  are  opposite.  The  Hamming  distance 
between  the  two  fully  expanded  (up  to  order  2N)  vectors  can  be  calculated 
by  adding  the  number  of  terms  that  have  different  sign  over  all  the  orders 
of  the  expansion: 


TV  -  n\ 

0  j 

(N  -  n 


TV  -  n 


n\  f  N  —  n 


((:)( 


n\  f  N  —  n 


n\  (N  -  n 


n\(N-n 
1/  ViV  -n 


N  -n 
N  —  n  —  1 


N-n  \ 
N-n-2j  + 
i\  f  N  —  n 
5J  [jV-n-3 


N-n 

N-n- 


N  —  n 
N  —  n  —  5 


N-n 


=  2n~12N~n 


The  fact  that  the  Hamming  distance  is  2N  1  for  any  two  expanded  vectors 
(for  any  n)  proves  that  all  of  the  2^  vectors  become  orthogonal  and  that 
<  >=  2 N6mimj. 

(b)  Just  think  of  the  cases  where  one  of  the  two  vectors  is  (1,1,...,  1). 
Then,  all  the  other  vectors  z  have  equal  number  of  l’s  and  -l’s  because 
their  Hamming  distances  are  all  2N~l  from  the  (1,1,...,  1)  vector. 

(c)  See  reference  [13},  page  109. 


Slepian  has  discussed  this  orthogonalization  property  as  a  method  for 
designing  orthogonal  codes  and  has  given  a  different  proof  for  it [18].  The 
proof  presented  here  is  useful  for  characterizing  higher  order  memories  be¬ 
cause  it  allows  us  to  trace  the  contribution  of  each  order  of  the  expansion  to 
the  orthogonalization  and  immediately  derive  results  about  the  properties 
of  quadratic  and  cubic  memories.  The  output  vector  y  is 

2N 

V,  =  sgn{W,  ■  z }  =  sgn{J2  Wuz ,}  (9) 

«=i 

where  l  =  1, . . . ,  N0  and  Wt  is  a  2N  dimensional  weighting  row  vector.  The 
matrix  Wu  that  can  implement  the  xm  i— ►  ym  mapping  for  m  =  1  to  2N  can 
be  formed  in  this  case  simply  as  the  sum  of  outer  products  of  ym  and  zm: 

2N 

=  £  S',"*."  (10) 

m~  1 

2.2.2  Expansions  of  a  single  order 

The  orthogonalization  property  of  the  full  expansion  is  interesting  because 
it  shows  that  higher  order  memories  provide  a  complete  framework  that 
takes  us  from  the  simplest  “neuron”,  the  linear  discriminant  function,  to 
the  full  capability  of  a  Boolean  look-up  table.  Higher  order  memories  can 
indeed  provide  a  valuable  tool  for  designing  digital  programmable  logic 
arrays.  In  this  paper,  however,  we  are  intersted  in  associative  memories 
that  are  capable  of  accepting  inputs  with  large  N  (e.g.  if  N  =  10s  then 
2n  ss  lO300)  in  which  case  considering  a  full  expansion  of  the  input  data 
is  completely  out  of  the  question.  In  such  cases  we  are  really  interested 
in  an  expansion  that  contains  a  large  enough  number  of  terms  to  provide 
the  capacity  needed  to  learn  the  problem  at  hand.  In  this  subsection  we 
analyze  the  properties  of  partial  expansions  that  include  all  the  terms  of 
one  order. 

We  will  first  consider  the  memory  consisting  of  all  the  terms  of  a  quadratic 
expansion  with  binary  input  vectors. 

y.  =  sgn(^2Yl  wujxix)} 
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The  number  of  nonredundant  terms  in  a  quadratic  expansion  of  a  binary 
vector  is  L  =  N(N  —  l)/2.  Let  two  input  vectors  have  a  Hamming  distance 
n.  The  angle  between  these  two  vectors  is  given  by  the  relation  cos  0j  = 
1  —  ( 2n/N ).  The  angle  02  between  the  corresponding  z[x )  vectors  can  be 
readily  calculated  since  we  know  their  Hamming  distance  from  the  proof  of 
theorem  2(a): 


cos  6 2 


t  4n(N  —  n) 

1  ~  N(N  -  1) 

«  1  -  4p  +  4p2  =  (1  -  2p)2 


(12) 


where  p  =  n/N.  02  and  8 j  are  plotted  versus  p  in  Figure  3a.  For  p  <  .5, 
d2  is  always  larger  than  0t.  Specifically  for  p  <<  1,  02  =  y/2  X  0X.  We  see 
therefore  that  the  quadratic  mapping  not  only  expands  the  dimensionality 
which  provides  capacity  but  also  spreads  the  input  samples  apart,  a  gen¬ 
erally  desirable  property.  For  p  >  .5  the  quadratically  expanded  vectors 
are  closer  to  each  other  than  the  original  vectors  and  in  the  extreme  case 
n  —  N,  02  becomes  zero.  This  insensitivity  of  the  quadratic  mapping  to  a 
change  in  sign  of  all  the  bits  is  a  property  that  is  shared  by  all  even  order 
expansions.  Next  we  consider  a  cubic  memory 


y,  =  Iwujk^XjXk} 

•  1  k 

L 

=  (13) 

n=l 

where  L  =  )  +  N.  In  Figure  3b  we  plot  03,  the  angle  between  two 

cubically  expanded  binary  vectors  as  a  function  of  p.  For  convenience,  0\ 
is  also  plotted  in  the  same  figure.  In  this  case  03  increases  faster  with  p 
for  p  <  .5.  For  p  <<  1,  03  =  \/3  x  02.  At  p  «  .4  the  cubic  expansion 
gives  essentially  perfectly  orthogonal  vectors  while  for  p  >  .5,  03  remains 
smaller  than  8\  and  in  the  limit  p  =  1,  03  =  n.  Thus  the  cubic  memory 
discriminates  between  a  vector  and  its  complement. 
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The  basic  trends  that  are  evident  in  the  quadratic  and  cubic  memories 
generalize  to  any  order  r.  The  number  of  independent  terms  in  the  r-th 
order  expansion  of  a  binary  vector  is  which  is  maximum  for  r  «  N/2. 
Again  this  is  not  of  practical  importance  because  the  number  of  terms 
in  a  full  expansion  of  this  sort  is  prohibitively  large.  What  is  of  interest 
however  is  the  effectiveness  with  which  relatively  small  order  expansions 
can  orthogonalize  a  set  of  input  vectors.  The  angle  0T  between  two  vectors 
that  have  been  expanded  to  the  r-th  order  is  given  by  the  following  relation: 


cos  0r 


W 

1  -2E, 

—  odd  1 

0 

1 

f  N\ 

Uj 

i 

(14) 


We  can  obtain  a  simpler  expression  for  the  interesting  case  r  <<  N  and 
for  small  p,  0T  «  \fr  x  01. 


Proposition  3  For  r  <<  N, 


cos  0r  «  (1  —  2 p)r. 


(15) 


Moreover,  for  small  p, 
where  0X  2  y/p. 


0T  «  y/rOi 


(16) 


Proof  :  For  a  small  r,  we  can  make  the  approximations  «  7Vr/r!, 

(")  «  n* /i !,  and  (^1")  «  (N  -n)r_’/(r  ~  *)••  Then,  cos  0r  is  approximated 
as  follows: 


COS  0r 


--odd  ''V  - 

=  (1-2  p)r 


because  of  these  relationships: 

X  +  X  =(i-^  +  ^)r  =  i> 

i—odd  i=even 

E  -  E  = 

i—odd 


t  =  et ;en 


When  p  <<  1,  cos 8r,  which  is  approximately  1  —  0^/2!,  is  approximated  by 
1  -  2 rp  directly  from  Eq.  14  or  from  Eq.  15.  Therefore,  it  is  followed  by 
Eq.  16  that  0r  ss  2 <Jrp. 

We  plot  0r  versus  p  for  selected  orders  in  Figure  4  using  Eq.  15.  It  is 
evident  that  increasing  r  results  in  better  separated  feature  vectors.  Poly¬ 
nomial  mappings  act  as  an  effective  mechanism  for  increasing  the  dimen¬ 
sionality  of  the  space  in  which  inputs  are  classified  because  they  guarantee 
a  very  even  distribution  of  the  samples  in  this  new  space. 

2.3  Training  of  higher  order  memories 

Once  the  initial  polynomial  mapping  has  been  selected,  the  rest  of  the  sys¬ 
tem  in  a  higher  order  memory  is  simply  a  linear  discriminant  function.  As 
such  it  can  be  trained  by  any  of  the  existing  methods  for  training  linear  dis¬ 
criminant  functions.  For  instance  the  pseudoinverse[l,14,16]  can  be  used  to 
calculate  the  set  of  weights  that  will  map  a  set  of  L-dimensional  expanded 
vectors  zm  to  the  associated  output  vectors  ym.  Alternatively,  error  driven 
algorithms  such  as  the  perceptron  or  adaline  can  be  used  to  iteratively  train 
the  memory  by  repeatedly  presenting  the  input  vectors  to  the  system,  mon¬ 
itoring  the  output  to  obtain  an  error  signal,  and  modifying  the  weights  so 
as  to  gradually  decrease  the  error.  The  relative  ease  with  which  higher 
order  memories  can  be  trained  is  a  very  important  advantageous  feature 
of  this  approach.  A  higher  order  memory  is  basically  a  multilayered  net¬ 
work  where  the  first  layer  is  selected  a  priori.  In  terms  of  capacity  alone, 
there  is  no  advantage  whatsoever  in  having  multiple  layers  with  modifiable 
weights.  From  theorem  1  we  know  that  at  best  the  capacity  is  determined 
by  the  number  of  modifiable  weights.  For  a  higher  order  memory  we  get 
the  full  advantage  of  the  available  degrees  of  freedom  whereas  if  we  put  the 
same  number  of  weights  in  multiple  layers  the  resulting  degeneracies  will 
decrease  the  capacity.  The  relative  advantage  of  trainable  multiple  layers  is 
the  potential  for  generalization  that  emerges  through  the  learning  process. 
The  generalization  properties  of  higher  order  memories  on  the  other  hand 
are  mostly  determined  by  the  choise  of  the  terms  used  in  the  polynomial  ex¬ 
pansion  in  the  fixed  first  layer.  Thus  the  generalization  properties  of  these 
memories  as  described  in  this  paper  are  imposed  a  priori  by  the  designer 


of  the  system. 

The  sum  of  outer  products  algorithm  that  has  been  used  extensively  for 
training  linear  associative  memories  can  also  be  used  for  training  the  higher 
order  memories  and  this  algorithm  generalizes  to  the  higher  order  case  in 
particularly  interesting  ways.  In  addition  this  particular  learning  algorithm 
is  predominantly  used  for  the  holographic  optical  implementations  that  are 
described  in  the  following  section.  Therefore  we  will  discuss  in  some  detail 
the  properties  of  higher  order  memories  that  are  trained  using  this  rule. 

2.3.1  The  outer  product  rule 

Let  us  consider  associative  memories  constructed  as  an  expansion  of  the  r- 
order  only  with  input  samples  in  an  N  dimensional  binary  space  and  r  >  1. 
Then 


y,  =  sSni  H  Whih -i.xlxiixhm"xir}  (17) 

hh-jr 

where  1  <  <  N,  1  <  /  <  N0.  The  number  of  independent 

terms  L  in  the  r-th  order  expansion  is  (/V+rr_1)  which  for  r  <<  N  can  be 
approximated  by  Nr/r\ 

The  expression  for  the  weights  of  the  r-th  order  expansion  using  the 
sum  of  outer  products  algorithm[2,3]  is 


M 


W,  =  Y'  vmxmxm  ■••xn 

vvhm-)r  x]\n 


m=  1 


(18) 


where  M  is  the  number  of  vectors  stored  in  the  memory,  ym  is  an  out¬ 
put  vector  associated  with  an  input  vector  xm  as  before.  With  the  above 
expression  for  the  weight  tensor  Eq.  17  can  be  rewritten  as  follows 

y ,  =  x?xiY  +  w?)-  (19) 

m=  1  j  =  1 


The  above  equation  suggests  an  alternate  implementation  for  higher  order 
memories  that  are  trained  using  the  outer  product  rule.  This  is  shown 
schematically  in  Fig.  5.  The  inner  products  between  the  input  vector  and 


all  the  stored  vectors  xm  are  formed  first,  then  raised  to  the  r-th  power, 
and  the  signal  from  the  m-th  unit  is  connected  to  the  output  through 
interconnective  weights  ym.  If  ym  =  xm  then  the  memory  is  autoassociative, 
and  in  this  case  the  output  can  be  fed  back  to  the  input  resulting  in  a  system 
whose  stable  states  are  programmed  to  be  the  vectors  xm.  This  becomes  a 
direct  extension  of  the  Hopfield  network'15,19,20;  to  the  higher  order  case. 
Assuming  that  x  —  xn  is  one  of  the  stored  vectors,  yt  becomes 

y,  =  sgn{N'y^  «?Ct  *“*?)'  +  «?} 

m^n  j = 1 

=  sgn{Nry?  +  nitf)}  (20) 

where  the  first  term  is  the  desired  signal  term,  ni  is  a  noise  term,  and  the 
thresholding  weight  is  set  to  zero. 

The  expectation  value  of  n((xn)  is  zero  if  the  bits  that  comprise  the  stored 
binary  input  and  output  vectors  are  drawn  randomly  and  independently 
having  equal  probability  of  being  +1  or  -1.  If  this  is  the  case  then 

E(Zx?x?)  =  X>,  S|E«’)=Et(  (21) 

If'  tt'  mm'  mm' 

where  6,,  is  the  Kronecker  delta  function.  The  variance  of  ni  is  calculated 


as  follows: 
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In  the  above  we  used  the  facts  that  different  stored  vectors  are  uncorrelated 
(i.e.  for  m  m')  and  y2  =  1.  Then,  the  variance  becomes  [M  —  l)Q(AT, r), 
whore  Q{N,  r)  is  the  number  of  possible  permutations  such  that 

‘ '  ’  ^*,tr  ~  1  (23) 

where  the  set  of  variables  {tj , .  . . ,  ir,  . . . ,  tT}  spans  all  the  combinations 
produced  by  the  set  of  variables  {ji,  •  •  ■ ,  jr.  Si,  •  ■  • ,  sr}.  The  variance  can 


23 


w 


be  calculated  exactly  for  the  cases  r= 1,2  and  3  and  it  is  (M  —  l)N ,  (M  — 
l)(3Ar2  — 2Ar)  and  (M-l)(l5N3—3QN2  +  16N),  respectively.  For  the  general 
case  we  will  derive  lower  and  upper  bounds  which  for  large  N  provide  us 
with  a  good  estimate  of  the  variance  for  any  ordei  r. 

Proposition  4  The  total  number  of  permutations,  Q(N,r),  for  which  Eq. 
23  holds,  satisfies  the  following  relationship: 


P(A’,r). 


where  P(m,n)  =  m!/(m  — n)!. 

Proof  :  The  number  of  ways  of  making  r  pairs  of  2r  items  is  (2r  —  l)(2r  — 
3)---(3)(l)  =  (2r)!/2rr!.  The  items  that  we  are  concerned  with  are  the 
variables  t; ,  tj  and  each  of  these  variables  can  take  one  of  N  values.  We  can 
only  select  the  values  of  half  these  variables  (JVr  possibilities)  and  for  each  of 
these  choises  we  can  create  r  pairs.  Hence  the  upperbound  is  Arr(2r)!/2rr!. 
This  is  an  upper  bound  because  we  have  overcounted  for  different  pairings 
of  variables  that  have  the  same  value. 

The  initial  lower  bound  is  derived  if  each  pair  has  a  different  value  from 
all  others,  which  eliminates  the  possibility  of  overcounting.  The  number  of 
possible  ways  to  satisfy  Eq.  23  with  the  variables  in  any  two  pairs  not  taking 
the  same  values  is  P(N,  r)(2r)!/2rr!.  This  is  an  underestimate  because  all 
pairs  that  contain  variables  taking  the  same  value  should  be  counted  once. 
We  can  thus  improve  the  lower  bound  by  counting  the  number  of  ways  these 
degenerate  pairings  occur  and  adding  them  into  the  previous  bound.  For 
example  when  two  pairs  out  of  r  have  the  same  values  with  choices, 

there  are  (2^NP(N  —  1,  r  -  2)(2r  -  4) ! / 2'  2 (r  —  2)!  possible  permutations 
where  (2r  —  4)1/2 r~2(r  —  2)1  is  the  number  of  ways  of  making  r  —  2  pairs  of 
2r  —  4  items.  Therefore,  Q(N,r)  is  lower  bounded  by  P(N,  r)(2r)!/2rr!  -f 
(24r)p(N,r  -  l)(2r  -  4)!/2r-2(r  -  2)'.,  since  NP(N-l,r-2)  =  P(N,r-l). 

We  can  get  a  very  good  approximation  to  the  SNR  using  the  approxi¬ 
mations  of  M  —  1  «  M  and  Q(N,r )  ss  Nr(2r)\/2rr\  which  are  very  nearly 
true  for  the  interesting  case  r  <<  N: 


SNR 


{MNr(2r)\/2'r\y/2 


V  s 


f  N  2  r  1 1/2  ,  . 

=  ■  <25> 

For  example,  the  linear  memory,  r  =  1,  has  a  SNR  ss  (N/M)1^2,  the 
quadratic  memory,  r  =  2,  a  SNR  of  N/(3M)1/2  and  the  cubic  memory, 
r  =  3,  a  SNR  of  [N3 /15M)1/2.  We  can  obtain  an  estimate  for  the  capacity 
of  an  r-th  order  memory  by  equating  the  signal  to  noise  ratios  of  the  linear 
and  r-th  order  memories  and  solving  for  Mr,  the  number  of  stored  vectors 
that  will  yield  the  equality.  For  r  small  compared  to  N  we  obtain 


=  AT"1- 


Comparing  its  value  with  the  capacity  M\  of  a  linear  memory  we  can  obtain 
the  relationship  between  the  capacities,  that  is,  MrjM\  =  Arr~12rr!/(2r)!. 
For  example  M2  of  a  quadratic  memory  is  Mi  TV / 3  and  M3  of  a  cubic  memory 
is  MXN2/1S. 

The  diagonal  terms  in  a  high  order  memory  Wljl y3...;r  can  be  defined 
as  those  of  which  all  the  indexes  j  are  not  different.  We  form  the  weight 
tensor  with  zero  diagonal  as  follows: 


"W-*  = 


Em  yrxTixTi  "'xTr  j’8  are  a11  d>fferei*t> 


otherwise. 


When  the  input  is  one  of  the  stored  vectors  xn  and  the  weight  tensor  has 
zero  diagonal,  the  output  y,  becomes 

y,  =  syn{  Whih  irxhxl  *"*£  +  w,9} 

d\  f  ferent  j 

=  sgn{P(N,r)y;+'£y'r  E  ' ' '  4  +  “t*8) 

different ; 

where  the  first  term  is  a  signal  term  and  the  second  a  noise  term  as  before. 
The  variance  of  the  noise  term  is  easily  shown  to  be  (M—  l)P(N,  r)r!  using 
Eq.  21.  Therefore,  the  SNR  becomes 


which  can  be  approximated  as  (iV/Mr!)1/2  for  r  <<  N. 

Chen  and  his  coworkers[3]  introduced  an  energy  function[l5,2l]  for  the 
r-th  order  autoassociative  memory  with  feedback  and  outer  products  as 
follows: 

M 

Er  =  -  £  <  xm,x  >r+1  (30) 

m=  1 

where  <  •,•  >  denotes  an  inner  product  of  two  vectors.  The  change  in  the 
energy  due  to  a  change  6x  in  the  state  of  the  network  was  shown  by  Chen 
et.  al.  to  be  decreasing  for  odd  r. 


A Er  =  Er{ x  +  6x)  -  Er(x) 

=  ~ir  +  1)Z(5Xi  H  Wlhh  irXhXi 


Xjr  Rr 


where 


fir  =  EE(r+1)<  xm,x  >r+W<  xm,6x  >’  . 
m  y=2  \  3  J 


The  first  term  in  Eq.  31  is  always  nonpositive  because  of  the  specification  of 
the  update  rule:  Sxi  >  0  if  J2jv..jr  j,xjxxh  ' ' ' x;,  >  0  and  vice  versa. 

They  showed  that  the  second  term  is  also  nonpositive  by  showing  that  Rr 
is  an  increasing  function  of  r  for  r  odd  and  Rx  >  0. 

For  r  even  it  is  possible  to  prove  that  the  autoassociative  memory  con¬ 
verges  only  for  asynchronous  updating  even  though  in  simulations  even  or¬ 
der  autoassociative  memories  consistently  converge  as  well.  The  fact  that 
the  energy  is  not  always  decreasing  when  r  is  even  may  actually  be  helpful 
for  getting  out  of  local  minima  and  settling  in  the  programmed  stable  state 
which  are  global  minima  in  a  region  of  the  energy  surface.  A  descent  proce¬ 
dure  that  is  always  decreasing  in  energy  cannot  escape  local  minima  since 
there  is  no  mechanism  for  climbing  out  of  them.  As  an  example,  consider 
a  quadratic  memory,  i.e.  r— 2  (even),  whose  energy  function  is  given  by 

El  =  ~  J2W'3kX>XJX*  l33) 

ijk 

A  E2  =  -3  J]  Wt]kx,xk6xx  -  3j2W'}kxk^^x}  -  Y^Wijk6xi6xj6xk.  (34) 

ijk  ijk  ijk 


The  first  term  is  nonincreasing  but  the  second  and  third  terms  can  be 
increasing.  If  the  vector  x  is  very  dose  to  one  of  the  stored  vectors  xn  then 
the  first  term  becomes  dominant  and  the  energy  will  be  very  likely  to  be 
nonoincreasing  causing  the  system  to  settle  at  x  =  xn.  If  x  is  not  close  to 
any  of  the  stored  vectors,  then  all  three  terms  in  the  above  equations  are 


on  the  average  comparable  to  each  other  and  since  two  of  them  are  not 
nondecreasing  the  energy  function  may  be  increasing  and  it  is  possible  to 
escape  from  local  minima. 


2.4 


Optical  implementations  of  quadratic  associative 
memories 


The  outer  product  quadratic  associative  memories  described  in  the  previous 
section  require  three  basic  components  for  their  implementation:  intercon¬ 
nective  weights,  a  square-law  device,  and  a  threshold  nonlinearity.  In  this 
section,  we  present  a  variety  of  optical  implementations  using  either  planar 
or  volume  holograms  to  provide  the  interconnection  pathways  and  optical 
or  electro-optical  devices  to  provide  the  required  nonlinearities. 

Since  holographic  techniques  are  used  to  implement  the  required  inter¬ 
connections,  we  will  first  briefly  discuss  holography[22]  and  in  particular 
the  distinction  between  the  use  of  planar  versus  volume  holograms.  The 
holographic  process  is  shown  schematically  in  Fig.  6.  In  the  recording  step 
(Fig.  6a)  the  interference  between  the  reference  plane  wave  that  is  created 
by  collimating  the  light  from  a  point  source  using  a  lens  and  the  wave  orig¬ 
inating  from  the  object  “A”  is  recorded  on  a  planar  light  sensitive  medium 
such  as  a  photographic  plate.  When  the  developed  plate  is  illuminated  with 
the  same  reference  wave,  the  field  that  is  diffracted  by  the  recorded  inter¬ 
ference  pattern  gives  a  virtual  image  of  the  original  object  which  can  be 
converted  to  a  real  image  with  a  lens.  The  reconstruction  of  the  hologram 
is  thus  equivalent  to  interconnecting  the  single  point  from  which  the  plane 
wave  reference  is  derived  to  all  the  points  that  comprise  the  reconstucted 
image.  The  weight  of  each  interconnection  is  specified  by  the  interference 
pattern  stored  in  the  hologram. 

Volume  holograms  are  prepared  and  used  in  the  same  manner  except 
that  whereas  a  planar  hologram  records  the  interference  pattern  as  a  two 


27 


dimensional  pattern  on  a  plane,  a  volume  hologram  records  the  interference 
pattern  throughout  the  volume  of  a  three  dimensional  medium.  The  dis¬ 
parity  in  the  dimensionalities  of  the  two  storage  formats  results  in  marked 
differences  in  the  capabilities  of  the  two  processes.  This  difference  is  ex¬ 
plained  w'ith  the  aid  of  Figs.  7a  and  7b  where  the  reconstruction  of  both 
a  planar  and  a  volume  hologram  are  shown.  Each  hologram  is  prepared 
to  store  the  two  images  “A”  and  “O”  by  double  exposure  with  each  im¬ 
age  being  associated  with  a  reference  plane  wave  that  is  incident  on  the 
hologram  at  a  different  angle.  Each  reference  plane  wave  is  generated  by 
a  separate  point  source  and  thus  the  reconstuction  of  a  hologram  with  the 
two  reference  waves  is  equivalent  to  interconnecting  multiple  input  points 
to  all  the  points  on  the  plane  of  the  reconstucted  image.  In  the  case  of  the 
planar  hologram,  however,  when  either  one  of  the  reference  waves  is  inci¬ 
dent  both  images  are  reconstructed.  This  implies  that  we  cannot  in  this 
case  independently  specify  how  each  of  the  input  points  is  connected  to  the 
output.  In  contrast,  because  of  the  interaction  of  the  fields  in  the  third 
dimension [23]  the  volume  hologram  is  able  to  resolve  the  differences  in  the 
angle  of  incidence  of  the  reference  beam  and  upon  reconstruction  when  the 
reference  for  “A”  illuminates  the  medium,  only  “A”  is  reconstructed  and 
similarly  for  the  second  pattern.  When  both  input  points  are  on  simulta¬ 
neously  then  each  is  interconnected  to  the  output  independently  according 
to  the  way  it  was  specified  by  the  recording  of  the  two  holograms.  Thus 
volume  holograms  provide  more  flexibility  for  implementing  arbitrary  in¬ 
terconnections  which  translates  to  efficient  three  dimensional  storage  of  the 
interconnective  weights  needed  to  specify  the  quadratic  memory. 

Another  way  in  which  we  can  draw  the  distinction  between  planar  and 
volume  holograms  is  in  terms  of  the  degrees  of  freedom.  The  implemen¬ 
tation  of  a  quadratic  memory  whose  input  word  size  is  N  bits  requires 
approximately  TV 3  interconnections  for  the  three  dimensional  interconnec¬ 
tion  tensor.  The  number  of  degrees  of  freedom  of  the  planar  hologram  of 
area  A  is  upper  bounded  by  A/6 2  while  that  of  a  volume  hologram  is  limited 
to  \'/63,  where  V  is  the  volume  of  the  crystal  and  <5  is  the  minimum  detail 
that  can  be  recorded  in  any  one  dimension[24,25,26’.  Equating  the  degrees 
of  freedom  that  are  required  to  do  the  job  to  those  that  are  available,  the 
crystal  volume  is  determined  to  be  at  least  V  =  N363  whereas  a  planar 
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hologram  to  do  the  same  job  would  require  a  hologram  of  area  A  =  N3S2 . 
For  comparison,  a  network  with  N  —  103  can  in  principle  be  implemented 
using  a  cubic  crystal  with  the  length  of  each  side  being  lv  =  N6=l  cm,  but 
a  square  planar  hologram  is  required  to  have  the  length  of  each  side  be  at 
least  lp  =  TV3/2 <5=0. 33  m  at  6  =  10  fim.  Thus,  the  volume  hologram  offers 
a  more  compact  means  of  implementing  large  memory  systems. 

2.4.1  Volume  hologram  systems 

There  are  several  schemes  for  fully  utilizing  the  interconnective  capability 
of  volume  holograms[25,26].  For  the  implementation  of  quadratic  memo¬ 
ries  we  use  volume  holograms  to  fully  interconnect  a  2-D  pattern  to  a  1-D 
pattern  (N2  *-+  N  mappings)  and  also  the  reverse  ( N  N2).  The  geom¬ 
etry  for  recording  the  weights  for  both  cases  is  shown  in  Fig.  8a  and  the 
reconstruction  geometries  are  illustrated  in  Figs.  8b  and  8c.  The  circles 
represent  the  resolvable  spots  at  the  various  planes  in  the  system.  The 
waves  emanating  from  each  point  at  the  input  planes  are  transformed  into 
plane  waves  by  the  Fourier  transform  lenses  L\  and  L2  and  interfere  within 
the  crystal,  creating  volume  gratings. 

The  weights  are  loaded  into  the  volume  hologram  with  multiple  holo¬ 
graphic  exposures  in  the  system  of  Fig.  8a.  In  the  following  subsections 
we  will  describe  several  specific  procedures  for  doing  so.  For  the  N  >—►  N 2 
mapping  (Fig.  8b)  in  reading  out  the  stored  information,  a  single  source  in 
the  input  array  recontructs  one  of  the  N  2-D  image  consisting  of  N 2  pixels 
that  it  is  associated  with.  The  rest  of  the  images,  which  belong  to  the 
other  input  points,  are  not  read  out  because  of  the  angular  discrimination 
of  volume  holograms.  The  counterpart  to  this  scheme,  shown  in  Fig.  8c, 
which  implements  an  arbitrary  N2  >— >  N  mapping.  This  setup  is  basically 
the  same  as  that  of  Fig.  8b  except  that  the  roles  of  the  input  planes  have 
been  interchanged  or  equivalently  the  direction  in  which  light  propagates 
has  been  reversed. 

2.4.1a  N2  >— >  N  schemes 

First,  we  consider  a  method  by  which  the  full  three  dimensional  intercon¬ 
nection  tensor  is  implemented  directly  with  a  volume  hologram.  Recall  that 
if  the  weight  tensor  is  trained  using  the  sum  of  outer  products  then  it  is 
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given  by 


(35) 


M 

E,  m 

y,  x,  xk  i 

m=l 

where  xtm  represents  the  m-th  input  memory  vector  and  y,m  represents  the 
associated  output  vector.  Such  a  memory  is  accessed  by  first  creating  an 
outer  product  of  the  input  vector  and  multiplying  it  with  wl]k  as  follows: 

N  N 

y,  =  sgn{Y^  ^ijkxjxk}-  (36) 

i= 1 t=i 

The  volume  hologram  is  prepared  using  the  set-up  in  Fig.  8a.  First,  the 
outer  product  matrix  of  the  m-th  memory  input  vector,  x^x”,  is  formed  on 
an  electronically  addressed  spatial  light  modulator  (SLM)  [27].  Another  one 
dimensional  SLM  whose  transmittance  represents  the  m-th  output  vector 
ytm  is  placed  in  the  other  input  plane,  and  the  two  SLMs  are  illuminated 
by  coherent  light.  The  transmitted  waves  are  then  Fourier  transformed 
by  lenses  L3  and  L2  to  interfere  within  the  crystal  volume  to  create  index 
gratings.  This  procedure  is  repeated  for  all  M  associated  input-output  pairs 
so  that  a  sum  of  M  holograms  is  created  in  the  crystal.  For  the  quadratic 
outer  product  memory  whose  capacity  is  fully  expended,  this  involves  on 
the  order  of  TV2/  log  TV  exposures. 

We  will  now  describe  another  method  for  recording  the  weight  vector  in 
the  volume  hologram  that  involves  fewer  exposures  and  can  also  be  used  not 
only  for  the  outer  product  scheme  but  for  recording  any  given  weight  tensor 
as  well.  The  same  basic  recording  architecture  of  Fig.  8a  is  used  in  this  case 
also.  In  the  first  exposure,  the  top  light  source  in  the  linear  array  is  turned 
on  while  the  SLM  is  programmed  with  the  matrix  uq;*,  where  u\;*  is  the 
interconnection  tensor.  When  the  SLM  is  illuminated  with  light  coherent 
with  that  of  the  point  source,  the  crystal  records  the  mutual  interference 
pattern  as  a  hologram  of  the  image  w i}k  with  a  reference  beam  that  is  the 
plane  wave  generated  from  the  top  light  source.  In  the  next  step,  the  second 
source  is  turned  on  while  the  SLM  is  programmed  with  the  matrix  w2jk. 
In  this  manner  the  connectivity  for  all  the  points  in  the  linear  array  at  the 
input  are  sequentially  specified  and  the  memory  training  is  completed  when 
all  N  exposures  have  been  made.  The  disadvantage  of  this  method  relative 
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to  the  outer  product  recording  is  the  need  to  precalculate  electronically 
the  weight  tensor  but  it  has  the  advantage  of  fewer  exposures  ( N  versus 
j  .V2/  log  N)  and  greater  flexibility  in  choosing  the  training  method. 

1  The  architecture  in  Fig.  8c  is  used  to  access  the  data  stored  in  the 

hologram  by  either  one  of  the  recording  methods  described  above.  The 
electronically  addressed  2-D  SLM  is  placed  at  the  input  plane  and  it  is  pro¬ 
grammed  with  the  outer  product  matrix  x*x;  of  the  input  vector.  The  light 
from  the  N2  input  points  is  interconnected  with  the  N  output  points  via 
the  recorded  w,.*  interconnect  kernel.  A  linear  array  of  N  photodetectors 
is  positioned  to  sample  the  output  points. 

It  is  important  to  restate  at  this  juncture  that  this  particular  implemen¬ 
tation  achieves  the  quadratic  interconnections  by  first  transforming  the  N 
input  features  (i.e.,  the  N  elements  of  the  input  vector  Xj )  into  a  set  of 
N2  features  via  the  outer  product  operation.  The  result  is  that  although 
the  interconnections  are  quadratic  with  respect  to  the  N  original  feature 
points,  they  are  linear  with  respect  to  the  N 2  transformed  features.  This 
allows  the  application  of  error  driven  learning  algorithms  for  linear  net¬ 
works  such  as  the  Adaline[ 28]  where  the  interconnections  are  developed  by 
an  iterative  training  process[29].  The  operation  of  such  a  learning  scheme 
is  illustrated  in  Fig.  9  which  is  the  same  basic  architecture  as  Fig.  8c  with 
feedback  from  the  output  back  into  one  of  the  input  ports.  Each  itera¬ 
tion  consists  of  a  reading  and  a  writing  phase.  During  the  reading  phase, 
the  interconnections  present  in  the  crystal  are  interrogated  with  a  partic¬ 
ular  item  to  be  memorized  by  illuminating  the  2-D  SLM  which  contains 
the  outer  product  matrix  xmxmt  and  the  output  is  formed  on  the  detec¬ 
tor  array.  In  the  subsequent  writing  phase,  the  error  pattern  generated  by 
subtracting  the  actual  output  from  the  desired  output  pattern  is  loaded 
into  the  1-D  SLM  and  both  SLM’s  (the  2-D  SLM  still  contains  xmxmt)  are 
illuminated  with  coherent  light,  forming  a  set  of  gratings  in  addition  to 
the  previously  recorded  gratings.  The  procedure  is  iteratively  repeated  for 
each  item  to  be  memorized  until  the  output  error  is  sufficiently  small.  This 
algorithm  is  a  descent  procedure  designed  to  minimize  the  mean  squared 
cost  e  =  ~  Em=i  [e£=i  wijkX?z?  ~  y’"] *  by  iteratively  updating  the 
interconnection  values. 


2.4.1b  N  —♦  N2  schemes 


The  N  *-+  N 2  mapping  capability  of  the  volume  hologram  which  is  the 
inverse  of  that  required  for  the  architectures  just  described  can  be  used 
also  to  implement  quadratic  memories  and  can  be  generalized  for  higher 
order  memories.  The  basic  idea  behind  this  scheme  is  illustrated  in  Fig.  10 
which  shows  the  interconnection  between  the  t-th  and  j-th  neurons  whose 
weight  Wij  is  a  linear  combination  of  all  of  the  inputs  and  is  described  by 

N 

W'J  =  YL  ^]kxk-  (37) 

*=1 

The  overall  result  is,  of  course,  recognized  to  be  the  equation  describing  the 
quadratic  memory,  but  the  notion  of  an  input  dependent  weight  suggests 
the  implementation  shown  in  Fig.  11.  The  system  is  basically  an  optical 
vector  matrix  multiplier[30]  in  which  the  matrix  is  created  on  an  optically 
addressed  SLM  by  multiplying  the  input  vector  with  the  three  dimensional 
tensor  stored  in  a  volume  hologram.  The  input  vector  is  represented  by  a 
one  dimensional  array  of  light  sources.  The  portion  of  the  system  on  the 
left  side  of  the  SLM  is  the  vector  matrix  multiplier  and  it  works  as  follows. 
Light  from  each  input  point  is  imaged  horizontally  but  spread  out  vertically 
so  that  each  source  illuminates  a  narrow,  vertical  area  on  the  2-D  SLM.  The 
reflectance  of  the  SLM  corresponds  to  the  matrix  of  weights  wtJ  in  Eq.  37. 
The  reflected  light  from  the  SLM  travels  back  towards  the  input  and  a 
portion  of  it  is  reflected  by  a  beam  splitter  and  then  imaged  horizontally 
but  focused  vertically  onto  a  1-D  output  detector  array.  The  output  from 
the  detector  array  represents  the  matrix  vector  product  between  the  input 
vector  and  the  matrix  represented  by  the  2-D  reflectance  of  the  SLM.  The 
matrix  of  weights,  in  this  case,  is  not  fixed  but  rather  computed  from  the 
input  via  a  volume  hologram  by  exposing  the  righthand  side  of  the  SLM 
as  shown  in  the  figure.  The  optical  system  to  the  right  of  the  2-D  SLM 
in  Fig.  11  is  the  same  as  the  N  *-»  N2  system  of  Fig.  8b.  The  volume 
hologram  which  has  been  prepared  to  perform  the  appropriate  dimension 
increasing  operation  (N  N2),  transforms  the  light  distribution  given 
by  its  one  dimensional  array  of  sources  into  the  input  dependent  matrix 
of  weights  given  by  Eq.  37.  This  system  is  functionally  equivalent  to  the 
previous  system  except  it  does  not  require  the  use  of  a  2-D  electronically 
addressed  input  SLM.  The  1-D  devices  utilized  in  this  architecture  are 


easier  and  faster  to  use  in  practice.  Instead  a  2-D  optically  addressed  SLM 
is  needed  which  in  practice  is  simpler  to  use  compared  to  electonically 
addressed  devices  (requires  less  electronics),  typically  has  more  pixels,  and 
potentially  much  higher  speed.  A  disadvantage  of  this  method  however  is 
that  it  does  not  lend  itself  for  the  direct  implementation  of  the  simple  outer 
product  training  method  wdthout  the  use  of  an  electronically  addressed  2-D 
SLM. 

The  TV  *->  TV2  mapping  technique  can  be  used  in  conjunction  with  its 
inverse,  the  TV2  i— »  TV  mapping,  to  implement  the  quadratic  outer  prod¬ 
uct  memory  using  two  volume  holograms,  a  1-D  electronically  addressed 
SLM,  and  an  optically  addressed  2-D  SLM.  Shown  in  Fig.  12  is  a  schematic 
diagram  of  such  a  system.  The  first  hologram  is  prepared  with  the  mul¬ 
tiple  exposure  scheme  discussed  earlier  (Fig.  8a)  where  for  each  exposure, 
a  memory  vector  in  the  one  dimensional  input  array  and  one  point  in  the 
two  dimensional  ( y/M  x  \/M)  input,  training  array  are  turned  on  simulta¬ 
neously.  The  second  hologram  is  prepared  by  a  similar  procedure  except 
that  the  associated  output  vectors  are  recorded  in  correspondence  to  each 
point  in  the  two  dimensional  training  plane.  After  the  holograms  are  thus 
prepared,  an  input  vector  is  loaded  into  the  one  dimensional  input  array 
and  the  correlations  between  it  and  the  M  memory  vectors  are  displayed 
in  the  output  plane[31,32,33j.  An  optically  addressed  SLM  can  be  used 
to  produce  an  amplitude  distribution  which  is  the  square  of  the  incident 
correlation  amplitudes.  The  processed  light  then  illuminates  the  second 
hologram  which  serves  as  an  M  *— ►  TV  interconnection,  each  correlation 
peak  in  the  SLM  plane  reading  out  its  corresponding  memory  vector  and 
forming  a  weighted  sum  of  the  stored  memories  on  the  one  dimensional  out¬ 
put  detector  array.  This  is  a  direct  optical  implementation  of  the  system 
shown  in  block  diagram  form  in  Fig.  5  with  the  2-D  SLM  performing  the 
square  law  nonlinearity  at  the  middle  plane  and  the  two  volume  holograms 
providing  the  interconnections  to  the  input  and  output. 

2.4.2  Planar  hologram  systems 

While  not  having  the  extra  dimension  to  directly  implement  the  three  di¬ 
mensional  interconnection  tensor  for  general  quadratic  memories,  planar 


holograms  can  nevertheless  implement  the  outer  product  quadratic  mem¬ 
ory  in  a  way  similar  to  the  one  used  in  the  system  just  described.  The 
planar  holographic  system  is  shown  in  Fig.  13.  Here,  the  information  is 
stored  in  the  two  multichannel  1-D  Fourier  transform  (FT)  holograms,  the 
first  of  which  contains  the  1-D  FT’s  of  the  M  memory  input  vectors  and 
the  other,  the  FT’s  of  the  associated  output  vectors[lOj.  The  first  part  of 
the  system  is  a  multichannel  correlator  which  correlates  the  input  against 
each  of  the  M  memory  vectors.  At  the  correlation  plane,  the  M  correlation 
functions  stacked  up  vertically  are  sampled  at  x  =  0  with  a  slit  to  obtain  the 
required  inner  products  which  are  then  squared  by  the  SLM.  Each  resulting 
point  source  of  light  is  then  collimated  horizontally  and  imaged  vertically 
onto  the  second  hologram  to  illuminate  that  portion  which  contains  the 
corresponding  output  vector.  The  final  stage  computes  the  FT  of  the  light 
distribution  just  following  the  second  hologram  to  produce  the  weighted 
sum  of  the  vectors  at  the  output  detector  array.  It  is  interesting  to  note 
that  if  the  SLM  is  removed  from  the  correlation  plane,  this  system  reduces 
to  the  linear  outer  product  memory. 

Notice  that  in  this  system  if  the  input  pattern  shifts  horizontally  then 
the  correlation  peak  also  shifts  in  the  correlation  plane  and  it  is  blocked  by 
the  slit  that  is  placed  there.  Therefore  shifted  versions  of  the  input  vector 
are  not  recognized,  as  expected.  Shift  invariance  where  the  shifted  versions 
of  the  memory  vectors  are  recognized  and  their  associated  outputs,  shifted 
by  the  same  amount  as  the  input,  are  retrieved  can  be  built  into  this  sytem 
by  simply  lengthening  the  input  SLM  and  the  output  detector  array  to 
accomodate  the  shifts  and  removing  the  slit  in  the  correlation  plane.  The 
resulting  system  treats  each  of  the  2 N  —  1  shifted  versions  of  the  memory 
vectors  as  a  new  memory  and  as  a  result,  the  increased  capacity  of  the 
quadratic  memory  over  the  linear  one  (by  a  factor  of  N)  is  expended  to 
provide  invariant  operation. 
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Figure  Captions 


Fig.  1  a.  Discriminant  function,  b.  Associative  memory  consructed  as  an 
array  of  discriminant  functions. 

Fig.  2  Higher  order  associative  memory. 

Fig.  3  a.  The  angle  between  linearly  and  quadratically  expanded  vectors 
as  a  function  of  the  hamming  distance  at  the  input,  b.  The  angle 
between  linearly  and  cubicaily  expanded  vectors  as  a  function  of  the 
hamming  distance  at  the  input. 

Fig.  4  The  angle  between  expanded  vectors  for  selected  orders. 

Fig.  5  Outer  product,  r-th  order  associative  memory. 

Fig.  6  Holographic  recording  and  reconstruction. 

Fig.  7  Holographic  interconnections  using  a. planar  versus  b.  volume  holo¬ 
grams. 

Fig.  8  Optical  interconnections  using  volume  holograms,  a.  Recording 
apparatus,  b .  N  >-*  jV2  mapping,  c.  TV2  <— ►  TV  mapping. 

Fig.  9  Optical  system  for  performing  error  driven  learning  in  a  higher 
order  memory. 

Fig.  10  Quadratic  mappings  implemented  as  nonnlinear  interconnections. 

Fig.  11  Optical  architecture  for  the  implementation  of  the  nonlinear  in¬ 
terconnections  of  Fig.  10. 

Fig.  12  Optical  higher  order  associative  memory  implemented  with  vol¬ 
ume  holograms. 

Fig.  13  Optical  implementation  of  the  outer  product  higher  order  memory. 
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3  Connectivity  versus  Entropy 

How  does  the  connectivity  of  a  neural  network  (number  of  synapses  per 
neuron)  relate  to  the  complexity  of  the  problems  it  can  handle  (measured 
by  the  entropy)?  Switching  theory  would  suggest  no  relation  at  all,  since 
all  Boolean  functions  can  be  implemented  using  a  circuit  wdth  very  low 
connectivity  (e.g.,  using  two-input  NAND  gates).  However,  for  a  network 
that  learns  a  problem  from  examples  using  a  local  learning  rule,  we  prove 
that  the  entropy  of  the  problem  becomes  a  lower  bound  for  the  connectivity 
of  the  network. 

3.1  Introduction 

The  most  distinguishing  feature  of  neural  networks  is  their  ability  to  spon¬ 
taneously  learn  the  desired  function  from  ‘training’  samples,  i.e.,  their  abil¬ 
ity  to  program  themselves.  Clearly,  a  given  neural  network  cannot  just 
learn  any  function,  there  must  be  some  restrictions  on  which  networks  can 
learn  which  functions.  One  obvious  restriction,  which  is  independent  of  the 
learning  aspect,  is  that  the  network  must  be  big  enough  to  accommodate 
the  circuit  complexity  of  the  function  it  will  eventually  simulate.  Are  there 
restrictions  that  arise  merely  from  the  fact  that  the  network  is  expected  to 
learn  the  function,  rather  than  being  purposely  designed  for  the  function? 
This  paper  reports  a  restriction  of  this  kind. 

The  result  imposes  a  lower  bound  on  the  connectivity  of  the  network 
(number  of  synapses  per  neuron).  This  lower  bound  can  only  be  a  con¬ 
sequence  of  the  learning  aspect,  since  switching  theory  provides  purposely 
designed  circuits  of  low  connectivity  (e.g.,  using  only  two-input  NAND 
gates)  capable  of  implementing  any  Boolean  function[l,2].  It  also  follows 
that  the  learning  mechanism  must  be  restricted  for  this  lower  bound  to 
hold;  a  powerful  mechanism  can  be  designed  that  will  find  one  of  the  low- 
connectivity  circuits  (perhaps  by  exhaustive  search),  and  hence  the  lower 
bound  on  connectivity  cannot  hold  in  general.  Indeed,  we  restrict  the 
learning  mechanism  to  be  local;  when  a  training  sample  is  loaded  into  the 
network,  each  neuron  has  access  only  to  those  bits  carried  by  itself  and 
the  neurons  it  is  directly  connected  to.  This  is  a  strong  assumption  that 


excludes  sophisticated  learning  mechanisms  used  in  neural-network  models, 
but  may  be  more  plausible  from  a  biological  point  of  view. 

The  lower  bound  on  the  connectivity  of  the  network  is  given  in  terms  of 
the  entropy  of  the  environment  that  provides  the  training  samples.  Entropy 
is  a  quantitative  measure  of  the  disorder  or  randomness  in  an  environment 
or,  equivalently,  the  amount  of  information  needed  to  specify  the  environ¬ 
ment.  There  are  many  different  ways  to  define  entropy,  and  many  technical 
variations  of  this  concept[3].  In  the  next  section,  we  shall  introduce  the  for¬ 
mal  definitions  and  results,  but  we  start  here  with  an  informal  exposition 
of  the  ideas  involved. 

The  environment  in  our  model  produces  patterns  represented  by  N  bits 
x  =  X\  •  •  -ijv  (pixels  in  the  picture  of  a  visual  scene  if  you  will).  Only  h 
different  patterns  can  be  generated  by  a  given  environment,  where  h  <  2N 
(the  entropy  is  essentially  log2/i).  No  knowledge  is  assumed  about  which 
patterns  the  environment  is  likely  to  generate,  only  that  there  are  h  of  them. 
In  the  learning  process,  a  huge  number  of  sample  patterns  are  generated 
at  random  from  the  environment  and  input  to  the  network,  one  bit  per 
neuron.  The  network  uses  this  information  to  set  its  internal  parameters 
and  gradually  tune  itself  to  this  particular  environment.  Because  of  the 
network  architecture,  each  neuron  knows  only  its  own  bit  and  (at  best)  the 
bits  of  the  neurons  it  is  directly  connected  to  by  a  synapse.  Hence,  the 
learning  rules  are  local:  a  neuron  does  not  have  the  benefit  of  the  entire 
global  pattern  that  is  being  learned. 

After  the  learning  process  has  taken  place,  each  neuron  is  ready  to  per¬ 
form  a  function  defined  by  what  it  has  learned.  The  collective  interaction 
of  the  functions  of  the  neurons  is  what  defines  the  overall  function  of  the 
network.  The  main  result  of  this  paper  is  that  (roughly  speaking)  if  the 
connectivity  of  the  network  is  less  than  the  entropy  of  the  environment, 
the  network  cannot  learn  about  the  environment.  The  idea  of  the  proof  is 
to  show  that  if  the  connectivity  is  small,  the  final  function  of  each  neuron 
is  independent  of  the  environment,  and  hence  to  conclude  that  the  over¬ 
all  network  has  accumulated  no  information  about  the  environment  it  is 
supposed  to  learn  about. 
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3.2  Formal  result 

A  neural  network  is  an  undirected  graph  (the  vertices  are  the  neurons  and 
the  edges  are  the  synapses).  Label  the  neurons  1,---,AT  and  define  Kn  C 
{1 ,  •••,7V’}  to  be  the  set  of  neurons  connected  by  a  synapse  to  neuron  n, 
together  with  neuron  n  itself.  An  environment  is  a  subset  e  C  {0, 1}*  (each 
x  £  e  is  a  sample  from  the  environment).  During  learning,  xi,  •  •  • ,  (the 
bits  of  x)  are  loaded  into  the  neurons  1  respectively.  Consider  an 

arbitrary  neuron  n  and  relabel  everything  to  make  Kn  become  {l,  ■  •  • ,  K). 
Thus  the  neuron  sees  the  first  K  coordinates  of  each  x. 

Since  our  result  is  asymptotic  in  N,  we  will  specify  K  as  a  function  of  N ; 
K  =  aN  where  a  =  a(N )  satifies  limAr-.oo  a{N)  —  aD  (0  <  a0  <  1).  Since 
the  result  is  also  statistical,  we  will  consider  the  ensemble  of  environments 

i 

E  =  S{N)  =  jeC  {0,1}*  |  | e |  =  h} 

where  h  —  20N  and  (3  =  (3(N)  satifies  lim^-.oo  (3(N)  =  {30  (0  <  0O  <  1). 
The  probability  distribution  on  £  is  uniform;  any  environment  e  £  £  is  as 
likely  to  occur  as  any  other. 

The  neuron  sees  only  the  first  K  coordinates  of  each  x  generated  by  the 
environment  e.  For  each  e,  we  define  the  function  n  :  {0, 1}K  i— ►  {0, 1, 2,  •  •  •} 
where 

n(ai  •  *  •  a/f)  =  |{x  G  t  |  xk  =  ak  for  k  -  l,- •  ■  ,K}\ 


and  the  normalized  version 


an)  = 


The  function  u  describes  the  relative  frequency  of  occurrence  for  each  of 
the  2k  binary  vectors  x\  •  •  •  xk  as  x  =  Xi  •  •  •  x^r  runs  through  all  h  vectors 
in  e.  In  other  words,  u  specifies  the  projection  of  e  as  seen  by  the  neuron. 
Clearly,  i^(a)  >  0  for  all  a  €  {0,  l}*  and  £ae{04}K  i'(a)  =  1. 

Corresponding  to  two  environments  t\  and  e^,  we  will  have  two  functions 
Vi  and  i/2.  If  is  not  distinguishable  from  i/2,  the  neuron  cannot  tell  the 
difference  between  and  e2.  The  distinguishability  between  ux  and  u2  can 
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d[uuu2)  =  -  J2 
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Ma)  ~  "2  (a)  | 


The  range  of  d[i/\,  u2)  is  0  <  d(u 1,1/2)  <  1,  where  ‘O’  corresponds  to  complete 
indistinguishability  while  ‘1’  corresponds  to  maximum  distinguishability. 
We  are  now  in  a  position  to  state  the  main  result. 

Let  ex  and  e2  be  independently  selected  environments  from  £  according 
to  the  uniform  probability  distribution.  v2)  is  now  a  random  variable, 
and  we  are  interested  in  the  expected  value  E[d(v\,u2)).  The  case  where 
E(d(u  1,^2))  =  0  corresponds  to  the  neuron  getting  no  information  about 
the  environment,  while  the  case  where  E{d(v  i,i/2))  =  1  corresponds  to  the 
neuron  getting  maximum  information.  The  theorem  predicts,  in  the  limit, 
one  of  these  extremes  depending  on  how  the  connectivity  ( a0 )  compares  to 
the  entropy  (/?0). 

Theorem. 

1.  If  a0  >  P0  ,  then  lim^^oo  E  {d(ui,u2))  =  1. 

2.  If  a0  <  0o  ,  then  lim/v-00  E  {d{v  1,^2))  =  0. 

The  proof  is  given  in  the  appendix,  but  the  idea  is  easy  to  illustrate 
informally.  Suppose  h  =  2*+10  (corresponding  to  part  2  of  the  theorem). 
For  most  environments  e  €  £ ,  the  first  K  bits  of  x  G  e  go  through  all  2K 
possible  values  approximately  210  times  each  as  x  goes  through  all  h  possible 
values  once.  Therefore,  the  patterns  seen  by  the  neuron  are  drawn  from  the 
fixed  ensemble  of  all  binary  vectors  of  length  K  with  essentially  uniform 
probability  distribution,  i.e.,  v  is  the  same  for  most  environments.  This 
means  that,  statistically,  the  neuron  will  end  up  doing  the  same  function 
regardless  of  the  environment  at  hand. 

What  about  the  opposite  case,  where  h  =  2*-10  (corresponding  to  part 
1  of  the  theorem)?  Now,  with  only  2K_10  patterns  available  from  the  en¬ 
vironment,  the  first  K  bits  of  x  can  assume  at  most  2K~10  values  out  of 
the  possible  2K  values  a  binary  vector  of  length  K  can  assume  in  princi¬ 
ple.  Furthermore,  which  values  can  be  assumed  depends  on  the  particular 
environment  at  hand,  i.e.,  v  does  depend  on  the  environment.  Therefore, 
although  the  neuron  still  does  not  have  the  global  picture,  the  information 
it  has  says  something  about  the  environment. 
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In  this  appendix  we  prove  the  main  theorem.  We  start  by  discussing 
some  basic  properties  about  the  ensemble  of  environments  £ .  Since  the 
probability  distribution  on  £  is  uniform  and  since  \£  \  =  (2A  ),  we  have 


n\  _1 


Pr(e)  = 


which  is  equivalent  to  generating  e  by  choosing  h  elements  x  6  {0, 1}N  with 
uniform  probability  (without  replacement).  It  follows  that 


Pr(x  €  e)  =  — 


while  for  Xi  ^  x2, 


Pr(xi  e  e  ,  x2  €  e) 


h  h~l 
2 "  *  2N  -  1 


and  so  on. 

The  functions  n  and  u  are  defined  on  if -bit  vectors.  The  statistics  of 
n(a)  (a  random  variable  for  fixed  a)  is  independent  of  a 

Pr(n(aj)  =  m)  =  Pr(n(a2)  =  m) 

which  follows  from  the  symmetry  with  respect  to  each  bit  of  a.  The  same 
holds  for  the  statistics  of  i'(a).  The  expected  value  J?(n(a))  =  h2~K  (h 
objects  going  into  2A  cells),  hence  E(v(a))  =  2~K .  We  now  restate  and 
prove  the  theorem. 

Theorem. 

1.  If  a0  >  0O  ,  then  lim^-.^  E  i/2))  =  1. 

2.  If  a0  <  (30  ,  then  lim^-a,  E  (d(v i,i/2))  =  0. 

Proof. 

We  expand  E(d{vx,  u2))  as  follows 

E{d(v  i,t/2))  =  eI^  ]T  ki(a)-*/2(a)|) 

V  »e{o.i}«  J 
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=  Wr  z2  £(K(a)  -  n2(a)|) 

2h  .£{0,1}* 

2k 

where  nt  and  n2  denote  ni(O---O)  and  n2(0---0),  respectively,  and  the 
last  step  follows  from  the  fact  that  the  statistics  of  ni(a)  and  n2(a)  is 
independent  of  a.  Therefore,  to  prove  the  theorem,  we  evaluate  E{\rii  —  n2|) 
for  large  N . 

1.  Assume  a0  >  (30.  Let  n  denote  n(0---0),  and  consider  Pr(n  =  0).  For 
n  to  be  zero,  all  2N~K  strings  x  of  N  bits  starting  with  K  0’s  must  not  be 
in  the  environment  e.  Hence 

Pr(n  =  0)  =  (1  -  A)(l  -  -(1  -  ^_y_K  +  1) 

where  the  first  term  is  the  probability  that  0  ■  •  •  00  §?  e,  the  second  term  is 
the  probability  that  0  •  ■  ■  01  £  e  given  that  0  •  •  ■  00  £  e,  and  so  on. 

.  < >N-K 

_ * _ 

2N  _ 2n-k J 

>  (1  -  2h2~N)2N'K 

>  l-2h2~N2N~K 
=  1  -  2h2~K 

Hence,  Pr(nx  =  0)  =  Pr(n2  =  0)  =  Pr(n  =  0)  >  1  —  2h2~K .  However, 
E(nx)  =  E(n2)  =  h2~K .  Therefore, 

h  h 

E{\n\  —  n2|)  =  X]IZPr(ni  =  =  i)|*  -  j\ 

«=o  >=o 

=  EEPr(ni  =  0Pr(n2  =  i)l*  -j\ 

»'= 0  >= 0 

h 

>  Y  Pr(r»i  =  0)  Pr(n2  =  j)j 


which  follows  by  throwing  away  all  the  terms  where  neither  x  nor  j  is  zero 
(the  term  where  both  x  an  j  are  zero  appears  tw’ice  for  convenience,  but 
this  term  is  zero  anyway). 


=  Pr(ni  =  0)£’(n2)  +  Pr(n2  =  O).E'(ni) 
>  2(1  —  2h2~K)h2~K 


Substituting  this  estimate  in  the  expression  for  £(d(i/i,  t/2)),  we  get 

E{d{u i,J/2))  =  £(K-  n2|) 

2k 

>  —  x  2(1  -  2h2~K)h2~K 

=  1  —  2h2~K 
=  1  -  2  x  2[p'a)N 

Since  a0  >  (30  by  assumption,  this  lower  bound  goes  to  1  as  TV  goes  to 
infinity.  Since  1  is  also  an  upper  bound  for  d{v i,t/2)  (and  hence  an  upper 
bound  for  the  expected  value  E(d(ui,  lim^_0O  E(d(ui,  u2))  must  be 

1. 

2.  Assume  a0  <  0O.  Consider 

E(\ni  —  n2|)  =  E  [\(ni  -  h2~K)  -  (n2  -  h2~K)\^j 
<  -  h2~K\  +  | n2  —  h2~K j) 

=  E(\rii  -  h2~K \)  +  E(\n2  -  h2~K\) 

=  2E(\n  —  h2~K\) 


To  evaluate  E(\n  —  h2~K\),  we  estimate  the  variance  of  n  and  use  the 
fact  that  E(\n  -  h2~K\)  <  ^/var(n)  (recall  that  h2~K  =  E{n)).  Since 
var(n)  =  E(n 2)  —  (E(n))2,  we  need  an  estimate  for  E(n2).  We  write  n  — 
£ae{o,i}"-*  where 


<5. 


1,  if  0  •  •  •  Oa  G  e; 
0,  otherwise. 


In  this  notation,  E(n 2)  can  be  written  as 


£(»’)  =  £  E  E  «.«. 

\»€{0,1}N-K  be{0,l}N-K 

E  E 

«e{o,i}*-*  be{o,i}*-K 
For  the  ‘diagonal’  terms  (a  =  b), 

£(M.)  =  Pr(*«  =  l) 

=  h2~N 

There  are  2N~K  such  diagonal  terms,  hence  a  total  contribution  of  2N~K  x 
h2~N  —  h2~K  to  the  sum.  For  the  ‘off-diagonal’  terms  (a  ^  b), 


£(*.*,)  =  Pr(«.  =  lA  =  l) 

=  Pr(4  =  l)Pr(4  =  l|^-l) 

h  h- 1 
2N  X  2N  —  \ 

There  are  2^“* (2N~K  -  1)  such  off-diagonal  terms,  hence  a  total  contribu¬ 
tion  of  2n~k (2n~k  —  1)  x  2^(2^ —  (h2_K)2jj^y  to  the  sum.  Putting  the 
contributions  from  the  diagonal  and  off-diagonal  terms  together,  we  get 


E(n 2) 

var(n) 


<  h2~K  +  [h2~K)2 
=  E(n>)  -  (E(n)Y 


2N  -  1 


<  h2~*  +  {h2~Ky 


2N  -  1 


=  h2~K  +  (h2-K)2^-1- 


2N  -  1 


<  2h2~K 


The  last  step  follows  since  h2  K  is  much  smaller  than  2^  —  1.  Therefore, 
E(\n  —  h.2~K\)  <  \Jva.T(n)  <  (2h2~K^j2.  Substituting  this  estimate  in  the 
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expression  for  E(d(i/i,  t^2)) »  we  get 


E{d{vuu2))  =  —E{  |fi!  —  n2|) 

<  ^■x2£(|n-W-'f|) 

2K  / 

<  —  x  2  x  (2/i2“*)3 

■  (4)J 

=  y/2  X  2^a~P)N 

Since  a0  <  /?„  by  assumption,  this  upper  bound  goes  to  0  as  At  goes  to 
infinity.  Since  0  is  also  a  lower  bound  for  d{u  1,1/2)  (and  hence  a  lower 
bound  for  the  expected  value  E{d(vi,v2))),  limAr_0O  E(d(ui,  z/2))  must  be 
0. 
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4  Adaptive  Optical  Networks  Using  Photore- 
fractive  Crystals 

4.1  Introduction 

Learning  is  the  most  distinctive  feature  of  a  neural  computer  and  in  many 
respects  it  is  this  aspect  that  gives  neural  computation  an  advantage  over 
alternative  computational  strategies.  A  neural  computer  is  trained  to  pro¬ 
duce  the  appropriate  response  to  a  class  of  inputs  by  being  presented  with  a 
sufficient  number  of  examples  during  the  learning  phase.  The  presentation 
of  these  examples  causes  the  strength  of  the  connections  between  neurons 
that  comprise  the  network  to  be  modified  according  to  the  specifics  of  the 
learning  algorithm.  A  successful  learning  procedure  will  result  in  a  trained 
network  that  responds  correctly  when  it  is  presented  with  the  examples  it 
has  seen  previously  and  also  other  inputs  that  are  in  some  sense  similar  to 
the  known  patterns.  When  we  consider  a  physical  realization  of  a  neural 
network  model,  we  have  two  options  in  incorporating  learning  capability. 
The  first  is  to  build  a  network  with  fixed  but  initially  programmable  con¬ 
nections.  An  auxiliary,  conventional  computer  can  then  be  used  to  “learn” 
the  correct  values  of  the  connection  strengths  and  once  learning  has  been 
completed  the  network  can  be  programmed  by  the  computer.  While  this 
approach  may  be  reasonable  for  some  applications,  a  system  with  continu¬ 
ously  modifiable  connections  presents  a  much  more  powerful  alternative. 

In  this  paper  we  consider  the  optical  implementation  of  learning  net¬ 
works  using  volume  holographic  interconnections  in  photorefractive  crys¬ 
tals.  The  use  of  volume  holograms  permits  the  storage  of  a  very  large 
number  of  interconnections  per  unit  volume  [1,2, 3, 4]  whereas  the  use  of 
photorefractive  crystals  permits  the  dynamic  modification  of  these  connec¬ 
tions,  thus  allowing  the  implementation  of  learning  algorithms  [5, 6, 7, 8, 9], 
We  first  briefly  review  the  major  types  of  learning  algorithms  that  are  being 
used  in  neural  network  models.  We  then  estimate  the  maximum  number 
of  holographic  gratings  that  can  simultaneously  exist  in  a  photorefractive 
crystal.  Since  in  an  optical  implementation  each  grating  corresponds  to  a 
separate  interconnection  between  two  neurons,  this  estimate  gives  us  the 
density  of  connections  that  are  achievable  with  volume  holograms.  The 
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next  topic  that  we  address  is  how  the  modulation  depth  of  each  grating  (or 
equivalently  the  strength  of  each  connection)  can  be  controlled  through  the 
implementation  of  learning  algorithms.  Two  related  issues  are  investigated: 
the  optical  architectures  which  implement  different  learning  algorithms  and 
the  reconciliation  of  physical  mechanisms  that  are  involved  in  the  recording 
of  holograms  in  photorefractive  crystals  with  the  dynamics  of  the  learning 
procedures  in  neural  networks. 

4.2  Learning  algorithms 

For  the  purposes  of  this  discussion  it  is  convenient  to  separate  the  wide 
range  of  learning  algorithms  that  have  been  discussed  in  the  literature  into 
three  categories:  prescribed  learning,  error  driven  learning  and  self  organi¬ 
zation.  We  will  draw  the  distinction  among  these  with  the  aid  of  Fig.  1, 
where  a  general  network  is  drawn  with  the  vector  x(k)  as  its  input  and  y(fc ) 
the  output  at  the  kth  iteration  (or  time  interval).  The  vector  z(/:)  is  used 
to  represent  the  activity  of  the  internal  units  and  t vtJ(k)  is  the  connection 
strength  between  the  Ith  and  the  jth  unit.  Let  m  =  1...M,  be  a  set 

of  specified  input  vectors  and  let  y ^  be  the  responses  which  the  network 
must  produce  for  each  of  these  input  vectors. 

A  prescribed  learning  algorithm  calculates  the  strength  of  each  weight 
simply  as  a  function  of  the  vectors  and  y^: 

t =  /,',(x(m),y(m))  m  =  1...M  (1) 

This  type  of  procedure  is  relatively  simple  (“easy  learning”).  It  is  perhaps 
the  most  sensible  approach  in  a  single  layer  network.  The  widely  used  outer 
product  algorithm  [10,1  lj  is  an  example  of  this  type  of  learning  algorithm, 
as  are  some  schemes  which  utilize  the  pseudoinverse  [10,12,13].  Despite 
its  simplicity,  prescribed  learning  is  limited  in  several  important  respects. 
First,  while  prescribed  learning  is  well  understood  for  single  layer  systems, 
the  existing  algorithms  for  two  layers  are  largely  localized  representations; 
each  input  activates  a  single  internal  neuron  [14,15,16].  Moreover,  the 
entire  learning  procedure  usually  has  to  be  completed  a  priori.  This  last 
limitation  is  not  encountered  in  the  simplest  form  of  prescribed  learning, 
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In  this  case  new  memories  may  be  programmed  by  simply  adding  the  outer 
products  of  new'  samples  to  the  weight  matrix.  Note  that  once  the  intercon¬ 
nection  matrix  has  been  determined  by  a  prescribed  learning  algorithm,  it 
may  be  expressed  in  the  form  of  a  sum  of  at  most  N  outer  products,  where 
N  is  the  total  number  of  neurons  in  each  layer.  Since  volume  holograms 
record  interconnections  matrices  represented  by  sums  of  outer  products 
in  a  very  natural  way,  matrices  which  can  be  expressed  in  this  form  are 
particularly  simple  to  implement  in  optics  [17,18,19,20]. 

Error  driven  learning  is  distinguished  by  the  fact  that  the  output  of  the 
system,  y(fc),  is  monitored  and  compared  to  the  desired  response  y^m\  An 
incremental  change  is  then  made  to  the  interconnection  weights  to  reduce 
the  error. 

Aw^k)  =/,;[x^,«x;r((A:),y(m)]  (3) 

The  change  A wtJ  is  calculated  from  the  vectors  and  and  the  cur¬ 
rent  setting  of  the  weight  matrix  wri(k )  (from  which  the  state  of  the  entire 
network  can  be  calculated).  The  perceptron  [2 1  ]  and  adaline  [22]  algorithms 
are  examples  of  error  driven  learning  for  single  layer  networks.  Interest  in 
such  learning  algorithms  has  been  renewed  recently  by  the  development  of 
procedures  suitable  for  multi-layered  networks  [23,24,25].  Error  driven  al¬ 
gorithms  (“hard  learning”)  are  more  difficult  to  implement  than  prescribed 
learning  since  they  require  a  large  number  of  iterations  before  errors  can 
be  reduced  to  sufficiently  low  levels.  In  multilayered  systems,  however, 
this  type  of  learning  can  provide  an  effective  mechanism  for  matching  the 
available  resources  (connections  and  neurons)  to  the  requirements  of  the 
problem.  In  optical  realizations  error  driven  algorithms  are  more  difficult 
to  implement  than  prescribed  approaches  due  to  the  need  for  dynamically 
modifiable  interconnections  and  the  incorporation  of  an  optical  system  that 
monitors  the  performance  and  causes  the  necessary  changes  in  the  weights 
[26].  While  this  problem  could  be  avoided  by  performing  learning  off  line 
in  computer  simulations  and  recording  the  optimized  interconnection  ma¬ 
trix  as  in  prescribed  learning,  this  approach  has  the  disadvantage  that  once 
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again  the  matrix  is  fixed  a  priori,  thus  preventing  the  network  from  being 
adaptive.  In  subsequent  sections  we  will  consider  a  relatively  simple  form 
of  Eqn.  (3)  in  which  A u\j(k)  depends  only  on  locally  available  information, 
i.e.  2,  in  one  layer  and  z:  in  an  adjacent  layer 

A«>, -,-(*)  =  c),y(m\xW}]  (4) 

The  perceptron  and  the  backwards  error  propagation  algorithms  both  fall  in 
this  subcategory  if  we  allow  the  neuronal  activity  z ,  to  include  error  signals, 
i.e.  if  each  neuron  has  distinct  signal  and  error  outputs  which  are  separated 
temporally  or  spatially.  An  example  of  such  a  neuron  implemented  in  optics 
is  given  below  in  conjunction  with  an  optical  back  error  propagation  system. 

In  the  case  of  self  organizing  learning  algorithms  we  require  not  that 
the  specified  inputs  produce  a  particular  response  but  rather  that  they 
satisfy  a  general  restriction,  often  imposed  by  the  structure  of  the  network 
itself.  Since  there  is  no  a  priori  expected  response,  the  learning  rule  fo’-  self 
organizing  systems  is  simply 

Atu<>(fc)  =  /„[x(m),tur<(/:)]  (5) 

This  type  of  learning  procedure  can  be  useful,  for  instance,  at  intermediate 
levels  of  a  network  where  the  purpose  is  not  to  elicit  an  external  response 
but  rather  to  generate  appropriate  internal  representations  of  the  informa¬ 
tion  that  is  presented  as  input  to  the  network.  There  is  a  broad  range  of  self 
organizing  algorithms,  the  simplest  of  which  is  probably  lateral  inhibition 
to  enforce  grandmother  cell  representations  [10,27].  The  objective  of  the 
learning  procedure  is  to  have  each  distinct  pattern  in  an  input  set  of  neurons 
activate  a  single  neuron  in  a  second  set.  In  the  architecture  shown  in  Fig.  2 
this  is  accomplished  via  inhibitory  connections  between  the  neurons  in  the 
second  set.  Once  a  particular  neuron  in  the  second  layer  is  partially  turned 
on  for  a  specific  pattern  it  prevents  the  connections  to  the  other  neurons 
in  the  second  set  from  assuming  values  that  will  result  in  activity  at  more 
than  one  neuron.  The  details  of  the  dynamics  of  such  procedures  can  be 
quite  complex  (e.g.  [28]),  as  can  corresponding  optical  implementations. 
An  advantageous  feature  of  optics  in  connection  with  self  organization  is 
that  global  training  signals,  such  as  fixed  lateral  inhibition  between  all  the 
neurons  in  a  given  layer,  can  easily  be  broadcast  with  optica!  beams. 


4.3  Interconnection  capabilities  of  volume  holograms 

The  basic  architecture  for  optical  implementation  of  a  neural  computer  is 
shown  in  Fig.  3.  The  figure  presents  a  single  stage  of  what  may  be  a  mul¬ 
tilayered  system.  The  nonlinear  processing  elements  (i.e.  the  “neurons”) 
are  arranged  in  planes.  We  have  included  a  “training  plane”  for  reasons 
w'hich  will  become  clear  below.  Neurons  in  one  plane  are  interconnected 
with  the  neurons  in  the  same  or  other  planes  via  the  third  dimension.  The 
strength  of  the  interconnections  is  determined  by  the  information  which  is 
holographically  stored  in  light  sensitive  media  placed  in  the  space  separat¬ 
ing  the  neural  planes.  Volume,  rather  than  “thin”,  holograms  are  specified 
in  Fig.  3  due  to  the  much  greater  storage  capacity  of  a  volume  holograms 
and  the  availability  of  excellent  real  time  volume  media.  Photorefractive 
crystals  are  particularly  attractive  as  holographic  media  in  this  application 
because  it  is  possible  to  record  information  in  these  crystals  in  real  time 
at  very  high  density  without  degrading  the  photorefractive  sensitivity.  In 
this  section  we  discuss  the  factors  that  determine  the  maximum  number  of 
connections  that  can  be  specified  by  a  photorefractive  crystal  with  a  given 
set  of  physical  characteristics.  There  are  three  distinct  factors  that  need 
to  be  considered:  geometric  limitations  arising  from  the  basic  principles  of 
volume  holography,  limitations  rising  from  the  physics  of  photorefractive 
recording,  and  limitations  due  to  the  learning  algorithms. 

The  Fourier  lenses  in  Fig.  3  transform  the  spatial  position  of  each  neuron 
into  a  spatial  frequency  associated  with  light  emitted  by  or  incident  on  that 
neuron.  An  interconnection  between  the  Ith  neuron  in  the  input  plane  and 
the  jth  neuron  in  the  output  plane  is  formed  by  interfering  light  emitted 
by  the  input  neuron  with  light  emitted  by  the  jth  neuron  in  the  training 
plane.  The  image  of  the  jth  training  neuron  lies  at  the  position  of  the  jth 
neuron  in  the  output  plane.  The  interference  of  the  training  signal  and  the 
input  creates  a  grating  in  the  recording  medium  of  the  form 

AXl;  =  AtA]e*‘>f  (6) 

where  A,  and  A}  are  the  amplitudes  of  the  fields  emitted  by  the  Ith  and 
jth  neurons,  respectively.  KtJ  is  equal  to  kt  —  kj  where  Jc,  and  k}  are  the 
spatial  frequencies  at  which  the  corresponding  amplitudes  propagate  in  the 
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volume  medium.  This  grating  diffracts  an  input  beam  at  spatial  frequency 
ka  into  an  output  beam  at  spatial  frequency  kp  if  these  two  beams  satisfy 
the  Bragg  constraint  that 

ka  -  kp  =  Ktj  (7) 

This  constraint  is  obviously  satisfied  if  ka  =  Ac,  and  kp  =  k In  general  this 
solution  is  not  unique.  How'ever,  Psaltis  et  al.  [2,3]  have  shown  that  by 
placing  the  neurons  on  the  input  and  output  planes  on  appropriate  fractal 
grids  of  dimension  |  it  is  possible  to  insure  that  only  the  tth  input  neuron 
and  the  jth  output  neuron  may  be  coupled  by  a  grating  with  wavevector 
K,j.  In  this  case,  recording  a  hologram  between  light  from  the  Ith  input  neu¬ 
ron  and  the  jth  training  neuron  increases  the  connection  strength  between 
the  ith  input  and  the  jth  output  without  directly  affecting  the  connections 
between  other  neurons.  If  instead  of  one  neuron,  patterns  of  neurons  are 
active  on  the  fractal  grids  of  the  input  and  training  planes  then  the  holo¬ 
gram  recorded  in  the  volume,  i.e.  Eqn.  (6)  summed  over  all  active  pairs 
of  neurons,  is  the  outer  product  of  the  pattern  on  the  input  plane  and  the 
pattern  on  the  training  plane.  Exposing  the  hologram  with  a  series  of  M 
pattern  yields  the  sum  of  outer  products  described  by  Eqn.  (2).  Note  that 
the  architecture  shown  in  Fig.  3  is  similar  to  a  joint  Fourier  transform  cor¬ 
rector.  The  use  of  volume,  rather  than  thin,  holograms  and  fractal  grids 
destroys  the  shift  invariance  of  the  correlator,  making  this  architecture  a 
totally  shift  variant  arbitrarily  interconnectable  system. 

A  basic  geometrical  limitation  on  the  density  of  interconnections  achiev¬ 
able  through  volume  holograms  is  due  to  the  finite  volume,  V,  of  any  real 
crystal.  The  refractive  index  n(r)  of  such  a  crystal  under  periodic  boundary 
conditions  may  be  represented  in  the  form 

n(r1  =  £  (8) 


kv  =  {vt{—)x  +  i/„(— )y  +  i>,(— )z)  z/,  =  0,±1,±2...  (9) 

XVj  L/y  ±Jg 

Where  nu  is  the  amplitude  of  the  Fourier  component  at  spatial  frequency 
kv  and  Lx  is  the  length  of  the  crystal  in  the  t  direction.  Since  the  maximum 
spatial  frequency  which  may  be  Bragg  matched  to  diffract  light  at  wave¬ 
length  A  is  2k0,  w’here  k0  =  ^  the  sum  in  Eqn.  8  is  finite  in  holographic 


applications.  The  number  of  spatial  frequencies  in  the  sum  is  S  «  jj. 
Psaltis  et  al.  [2,3;  demonstrated  that  S  is  sufficient  to  fully  and  indepen¬ 
dently  interconnect  neural  planes  which  are  limited  to  fractal  dimension  |. 
Thus  in  this  previous  work  the  issue  of  these  geometric  limitations  was  fully 
resolved  under  the  condition  that  processing  nodes  in  the  input  and  output 
planes  must  be  appropriately  arranged  on  fractal  grids.  Other  geometric 
limitations  arise  due  to  finite  numerical  apertures  and  the  physics  of  holo¬ 
graphic  recording  mechanisms.  These  factors  may  be  shown  to  contribute 
a  scaling  factor  to  S  which  is  independent  of  V  and  A.  For  V  =  1  cm3  and 
A  =  1pm,  is  equal  to  1012.  In  interconnecting  neurons  arranged  on  fractal 
planes,  even  though  the  recording  geometry  typically  allows  access  to  only 
\%  of  grating  wavevector  space,  we  still  may  achieve  1010  interconnections 
per  cm3. 

We  now  address  the  question  of  whether  this  large  number  of  grat¬ 
ings  can  be  supported  in  a  photorefractive  crystal,  i.e.  do  photorefractive 
crystals  have  the  capability  of  simultaneously  storing  1010  gratings  each 
with  sufficient  diffraction  efficiency?  In  this  paper  we  answer  this  question 
based  on  simple  arguments  in  the  context  of  a  neural  architecture.  The 
conclusions  we  reach  are  the  same  as  those  we  arrive  at  through  a  more 
thorough  examination  of  the  problem.  Photorefractive  holograms  axe  pro¬ 
duced  in  electrooptic  crystal  via  the  modulation  of  the  index  of  refraction 
by  the  space  charge  field  created  by  an  optically  driven  inhomogeneous 
charge  distribution.  A  neural  network  architecture  implemented  in  volume 
holograms  performs  a  transformation  of  the  form 

E,  tne}£>  Fe>*-  +  c.c.  =  £  *;e^V**  0Ute2*>V*'  +  c.c.  (10) 

i 

between  the  field  amplitude,  Ejoute3k>  T ,  of  the  jth  neuron  and  the  field 
amplitude,  Eane?ki'T,  incident  on  the  input  of  the  ith  neuron,  c.c.  denotes 
the  complex  conjugate  of  the  preceding  term.  4>j  and  <£,  are  the  phases  of 
the  field  amplitudes  corresponding  to  the  ith  and  jth  neurons.  is  the 
phase  of  the  grating  which  connects  the  ith  and  jth  neurons.  The  diffraction 
efficiencies  rj,j  are  proportional  to  the  component  of  the  space  charge  density 
in  the  crystal  at  spatial  frequency  K,;  =  k,  —  k ,  [29].  The  total  space  charge 
density  due  to  N  stored  gratings  is  constrained  at  every  point  in  the  crystal 


(11) 


to  be  less  than  the  acceptor  trap  density.  This  implies  that 


where  rj0  is  the  maximum  diffraction  efficiency  when  only  one  grating  is 
recorded.  If  0tJ-  is  an  independent  uniformly  distributed  random  variable 
on  (  —  7r,7r),  then  with  high  probability  the  right  side  of  Eqn.  (11)  will  not 
exceed  a  few  times  its  standard  deviation,  \pjrh-,  where  is  the  rms  value 
of  .  This  fact  allows  us  to  find  a  simple  limit  for  r?j  given  by 


Note  that  although  we  have  assumed  that  the  sums  in  Eqn.  (11)  are  over  a 
set  of  incoherent  sinusoids,  this  does  not  imply  that  the  sum  in  Eqn.  (10) 
is  incoherent.  To  illustrate  this  point  imagine  that  t/>,y  =  <£,  —  <pj.  In 
this  case  the  terms  in  Eqn.  (10)  add  coherently.  However  if  <f>,  and  <f>j  are 
independent  random  variables  the  sums  in  Eqn.  (11)  still  add  incoherently. 
Thus  a  random  phase  term  in  the  transmittance  at  each  neuron  causes 
the  charge  densities  stored  in  the  crystal  to  add  incoherently  but  does  not 
necessarily  destroy  the  coherence  of  the  optical  system. 

The  holographic  transformation  described  above  can  be  used  to  imple¬ 
ment  neural  architectures  which  map  an  activity  pattern  described  by  the 
outputs  {x;}  of  the  neurons  on  one  neural  plane  to  the  outputs  {y,}  of 
the  next  neural  plane.  In  a  coherent  optical  system  x,  is  represented  by 
Ej  oute3*’  and  u;,;  is  represented  by  ti^e**'* .  Since  most  simple  optical  non- 
linearities  are  based  on  absorption  the  transformation  between  {x;}  and 
{y,}  typically  takes  the  form 


y.  =  /(IL1 


where  f  is  a  thresholding  function  implemented  in  the  neural  plane.  This 
functional  form  might  be  avoided  using  interferometic  detection.  In  an 
incoherent  optical  system  x;  is  represented  by  | Ej  ou(|2  and  uj,;  is  represented 
by  rj*j.  The  transformation  between  {x,}  and  {yj  takes  the  form 

y.  =  (14) 


ms 


ms* 


In  either  case  the  function  /  must  provide  sufficient  gain,  G,  to  regenerate 
the  signal  power  of  the  system  after  each  layer.  If  we  assume  that  each  layer 
contains  y/N  neurons  then  the  relationship  between  the  power  incident  on 
a  single  neuron,  and  the  powder  output  by  a  single  neuron,  Fouf?  for  a 
coherent  system  with  0|;  =  4>,  —  4>j  is 

y/N  r 

/,„  =  K\T,riiie1+itEioute’*i\2  =  =  °u‘  (15) 

•  ^coherent 

From  Eqn.  (12)  we  find 


r coherent 


For  an  incoherent  system  the  corresponding  relationship  is 


I  in  =  X-YlVtjlEj  0ut\2  =  V Nvllout  = 


G incoherent 


In  this  case  Eqn.  (12)  yields 


*incohcrent  — 


Note  that  A  is  the  total  diffraction  efficiency  of  the  volume  hologram.  Since 
this  must  be  less  than  1  we  know  that  G  >  1.  rj0  is  determined  by  the 
physical  properties  of  the  crystal,  including  the  maximum  charge  density 
available  for  grating  storage,  the  thickness  of  the  crystal,  and  its  the  elec¬ 
trooptic  coefficients.  For  small  we  may  estimate  r)0  as  r]0  « 
where  L  is  the  length  of  the  crystal  along  the  optical  axis.  For  Ae  ss  10-5, 
A  «  1CT6  m  and  L  «  1CT2  m,  r\0  —  o(l).  This  means  that  in  coherent  sys¬ 
tems  relatively  little  gain  (i.e.  G  —  o(l))  is  needed  to  recall  a  large  number 
of  sinusoidal  gratings  stored  in  a  photorefractive  crystal.  Of  course  as  we 
attempt  to  store  arbitrarily  many  gratings  other  limits  arise,  but  at  least 
over  a  finite  bandwidth  of  the  electrooptic  response  of  the  crystal  coherent 
systems  should  have  no  difficulty  in  achieving  interconnection  densities  on 
the  order  of  those  implied  by  the  geometrical  constraints.  Incoherent  sys¬ 
tems,  on  the  other  hand,  are  unable  to  take  advantage  of  holographic  phase 


matching  and  are  thus  less  efficient  [30].  In  order  to  achieve  N  =  1010,  for 
example,  we  must  supply  a  gain  of  G  =  105  in  each  neural  plane.  Examples 
of  how  G  may  be  obtained  optically  include  various  combinations  of  image 
intensifiers  and  spatial  light  modulators  and  multiwave  mixing  in  nonlin¬ 
ear  materials.  For  example,  an  optically  addressed  spatial  light  modulator 
such  as  the  Hughes  liquid  crystal  light  valve  is  sensitive  to  approximately 
10 fiw /cm2 .  If  the  read-out  beam  has  an  intensity  of  lw/cm2  then  we  realize 
a  gain  of  105. 

The  choice  between  coherent  and  incoherent  implementations  of  optical 
neural  networks  offers  advantages  and  disadvantages  on  both  sides.  The  in¬ 
coherent  system  is  easier  to  implement  but  requires  the  large  gain  described 
above  and  offers  only  unipolar  activities  and  interconnection  strengths.  The 
coherent  implementation  offers  bipolar  activities  and  interconnections  but 
requires  rigid  phase  stability  in  the  optical  system  over  potentially  very  long 
learning  cycles.  This  stability  is  not  difficult  to  achieve  in  prescribed  learn¬ 
ing  architectures,  but  may  be  more  difficult  to  achieve  in  adaptive  systems. 
In  addition,  coherent  systems  generally  square  the  signal  incident  on  the 
nonlinearity,  unless  interferometric  detection  is  used.  Interferometric  de¬ 
tection  is  difficult  to  implement  in  a  complex  optical  system.  Although  the 
incoherent  system  is  straight  forward  to  implement,  this  simplicity  comes 
at  a  cost  of  requiring  biasing  to  compensate  for  unipolar  values  and  ex¬ 
ternal  gain.  The  coherent  system  is  more  elegant  in  that  these  additional 
mechanisms  are  not  necessary,  but  it  is  more  sensitive  to  specific  design 
issues.  One  way  of  making  coherent  implementations  more  robust  might 
be  to  include  adaptive  optics,  such  as  phase  conjugate  devices,  to  compen¬ 
sate  for  phase  instabilities.  Although  these  devices  might  also  be  needed 
in  adaptive  incoherent  systems  to  detect  the  phase  of  a  grating  in  order  to 
correctly  update  the  associated  interconnection,  in  the  incoherent  case  it 
is  only  necessary  to  detect  the  current  state  of  the  phase.  In  the  coherent 
case  it  is  generally  necessary  to  continuously  track  the  phase. 

4.4  Learning  architectures 

We  now  turn  to  the  question  of  how  we  can  specify  the  strength  of  each  in¬ 
terconnection.  There  is  a  nice  compatibility  between  simple  (multiplicative) 
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Hebbian  learning  and  holography;  the  strength  of  the  connection  between 
two  neurons  can  be  modified  by  recording  a  hologram  with  light  from  the 
two  neurons.  It  is  not  possible,  however,  to  record  multiple  holograms  in 
a  single  crystal  independently.  Thus  far  we  have  shown  that  the  space 
charge  in  a  photorefractive  crystal  may  be  arranged  to  achieve  a  very  large 
number  of  independent  interconnections.  The  task  that  remains  is  to  find 
a  means  of  using  optical  beams  from  outside  the  crystal  to  correctly  ar¬ 
range  the  three  dimensional  charge  distribution.  In  particular,  we  must 
find  means  to  address  the  full  three  dimensional  bandwidth  of  the  crystal 
from  two  dimensional  neural  planes.  In  order  to  successfully  implement 
learning  with  photorefractive  crystals  the  nonlinear  dynamics  that  govern 
the  multiple  exposure  of  holograms  in  a  photorefractive  medium  must  be 
reconciled  with  the  nonlinear  equations  that  describe  the  iterative  proce¬ 
dures  of  learning  algorithms.  It  is  extremely  difficult  to  fully  characterize 
analytically  the  ability  of  an  optical  system  to  simulate  a  particular  learn¬ 
ing  algorithm.  We  will  have  to  rely  heavily  on  experiment  in  the  search  for 
the  optimum  match  between  nonlinear  optics  and  learning  procedures  for 
neural  networks.  In  this  section  we  describe  learning  architectures  which 
are  relatively  simple  to  implement  experimentally  and  which  can  be  used 
to  evaluate  the  capability  of  photorefractive  crystals  to  store  information 
in  the  form  of  connectivity  patterns  in  a  neural  computer. 

The  first  learning  algorithm  we  consider  is  the  prescribed  sum  of  outer 
products  of  Eqn.  (2).  As  we  saw  in  the  previous  section,  a  sum  of  this  sort 
may  be  implemented  as  a  series  of  exposures  of  a  volume  hologram.  In 
a  photorefractive  crystal,  the  exposure  of  a  new  hologram  partially  erases 
previously  recorded  holograms.  This  places  an  upper  limit  on  the  maxi¬ 
mum  number  of  holograms  that  can  be  recorded  and  thus  the  number  of 
associations,  M,  that  can  be  stored  in  the  crystal.  The  limit  is  found  by 
determining  the  minimum  tolerable  diffraction  efficiency  for  each  associa¬ 
tion  and  solving  for  the  number  of  exposures  that  will  yield  this  efficiency. 
Let  Am  be  the  amplitude  of  the  mth  hologram  recorded.  After  a  total  of 
M  exposures 


Am  =  A0(  1  -  e^)exp(-£^=m+1^)  (19) 

1  e 

where  A0  is  the  saturation  amplitude  of  a  hologram  recorded  in  the  pho- 
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torefractive  crystal,  tm  is  the  exposure  time  for  the  mth  hologram,  rr  and  re 
are,  respectively,  the  characteristic  time  constants  for  recording  and  eras¬ 
ing  a  hologram  in  the  crystal.  We  allow  for  the  case  that  re  ^  rr  in  light 
of  limited  evidence  that  this  may  be  the  case  in  some  crystals  [31].  Ionic 
conductivity  is  one  mechanism  leading  to  multiple  time  constants.  We  can 
use  several  different  criteria  for  selecting  the  exposure  schedule  tm.  For 
instance  if  we  require  Am  =  Am+i  for  all  m  we  obtain 


(1  -  e  ”  )e  '« 
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(20) 


If  Tr  —  re  then  the  solution  to  Eqn.  (20)  under  the  boundary  condition 
fi  »  Tr  is 

tm  =  Te  ln^— -—[)  m>1  (21) 


which  yields 

Am  —  Am  =  ^  (22) 

For  the  case  rr  ^  re  we  define  pm  such  that  tm  =  pmre.  Since,  from  Eqn.  (19), 
limM—inf-^i  =  0  Eqn.  (20)  may  be  satisfied  only  if  limm_inf  tm  =  0.  Thus 
for  some  m0  >  1  pmo  1  and  tmo  rr.  Then,  from  Eqn.  (20), 


^m0  +  l 


or 


By  induction,  for  m  >  m0 


Pm„  + 1 


Pm0 
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As  m  grows  large  with  m0  fixed,  Eqn.  (24)  can  be  shown  to  yield 


(24) 
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The  value  of  m  for  which  this  approximation  holds  increases  with  the  ratio 
y.  In  the  case  rr  =  re  for  example,  ^  =  0.82  and  =  0.95.  In  any  case, 
for  M  :»  m0  for  some  m0  satisfying  the  constraints  preceding  Eqn.  (22), 

Am  =  Am  =  A0{\  -  t™*)  (26) 

for  all  m.  Solving  for  M  with  Am  <C  A0  we  find  a  limit  for  M  given  by 

M«-~  (27) 

^ r  -^m 

This  result  agrees  well  with  what  we  might  expect  intuitively.  The  number 
of  exposures  allowed  increases  in  proportion  with  the  ratio  ^  (if  we  erase 
slowly  we  can  store  more  holograms)  and  the  ratio  of  the  maximum  possible 
and  minimum  detectable  grating  amplitudes. 

The  second  architecture  we  will  discuss  is  capable  of  implementing  the 
backwards  error  propagation  algorithm  [23,24]  in  a  multilayered  network. 
The  architecture,  shown  in  Fig.  4,  is  a  variation  on  a  system  we  have  de¬ 
scribed  previously  [6],  The  system  as  shown  has  two  layers  but  an  arbitrary 
number  of  layers  can  be  implemented  as  a  straightforward  extension.  An 
input  training  pattern  is  placed  at  plane  N\.  The  pattern  is  then  intercon¬ 
nected  to  the  intermediate  (hidden)  layer  N2  via  the  volume  hologram  Hi. 
A  two  dimensional  spatial  light  modulator  placed  at  N2  performs  a  soft 
thresholding  operation  on  the  light  incident  on  it,  simulating  the  action  of 
a  2-D  array  of  neurons,  and  relays  the  light  to  the  next  stage.  Hologram  H2 
interconnects  N2  to  the  output  plane  where  a  spatial  light  modulator 
performs  the  final  thresholding  and  produces  a  2-D  pattern  representing 
the  response  of  the  network  to  the  particular  input  pattern.  This  output 
pattern  is  compared  to  the  desired  output  and  the  appropriate  error  im¬ 
age  is  generated  (either  optically  or  with  the  aid  of  an  image  detector  and 
re-recording)  on  the  spatial  light  modulator  IV4.  The  undiffracted  beams 
from  N\  and  N2  are  recorded  on  spatial  light  modulators  at  Ns  and  Ns, 
respectively.  The  signals  stored  at  Ns,  Nt,  and  Ns  are  then  illuminated 
from  the  right  so  that  light  propagates  back  towards  the  left.  The  back 
propagation  algorithm  demands  a  change  in  the  interconnection  matrix 
stored  in  H2  given  by 

A  =  -ae,/'(*;")zf‘  (28) 


where  q  is  a  constant,  e,  is  the  error  signal  at  the  ith  neuron  in  TV*,  x\n  is  the 
input  diffracted  onto  the  ith  neuron  in  N4  from  N 2,  f'[x)  is  the  derivative  of 
the  thresholding  function  f(x)  which  operates  on  the  input  to  each  neuron 
in  the  forward  pass,  and  x°ut  is  the  output  of  the  jth  neuron  in  N2.  Each 
neuron  in  N4  is  illuminated  from  the  right  by  the  error  signal  £,  and  the 
backward  transmittance  of  each  neuron  is  proportional  to  the  derivative  of 
the  forward  output  evaluated  at  the  level  of  the  forward  propagating  signal. 
As  we  have  described  above,  the  hologram  recorded  in  H2  is  the  outer 
product  of  the  activity  patterns  incident  from  N4  and  7V5.  Thus  the  change 
made  in  the  holographic  interconnections  stored  in  H2  is  proportional  to 
the  change  described  by  Eqn.  (28). 

The  change  in  the  interconnection  matrix  stored  in  Hi  required  under 
the  back  propagation  algorithm  is 

=  -  £  «,/' (*:•>;,■> /'(xj"x  (29) 
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where  x°m  is  the  activity  on  mth  input  on  JVj.  The  error  signal  applied  to  N4 
produces  a  diffracted  signal  at  the  Ith  neuron  in  N2  which  is  proportional 
to 

-X>./'Kn)* 4”  (3t>) 
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We  assume  that  during  the  correction  cycle  for  H\  N5  is  inactive.  Once 
again,  if  the  backward  transmittance  of  the  Ith  neuron  is  proportional  to 
f'(x\n)  then  the  change  made  to  the  hologram  by  the  signals  propagating 
back  from  N2  and  N3  is  proportional  to  the  change  prescribed  in  Eqn.  (29). 

A  key  element  in  this  architecture  is  the  assumption  that  the  spatial 
light  modulators  at  N2  and  Nt  may  have  transmittances  which  may  be 
switched  between  a  function  for  the  forward  propagating  signal  and 
f'(x )  for  the  back  propagating  signal.  In  both  cases  x  represents  the  for¬ 
ward  propagating  signal.  Two  of  us  (Wagner  and  Psaltis)  have  previously 
described  how  nonlinear  etalon  switches  might  be  used  in  this  application 
[7,8],  Electrooptic  spatial  light  modulators  might  also  be  used  [8]. 

We  have  performed  an  experiment  to  show  how  a  single  layer  of  error 
driven  learning  might  be  implemented.  This  experiment  is  shown  schemat¬ 
ically  in  Fig.  5.  In  this  case,  the  stored  vectors  x^  correspond  to  two 
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dimensional  patterns  recorded  on  a  liquid  crystal  light  valve  from  a  video 
monitor.  The  output  vectors  correspond  to  the  single  bit  output  of  the 
detector,  D.  The  input  vectors  are  imaged  onto  a  photorefractive  crystal 
via  two  separate  paths.  The  strength  of  the  grating  between  the  image  of 
the  input  along  one  path  and  the  image  along  the  other  path  is  read  out  by 
light  polarized  orthogonally  to  the  write  beams.  One  of  the  write  beams  is 
circularly  polarized  while  the  other  is  linearly  polarized.  The  polarizer,  P, 
blocks  the  out  of  plane  component  of  both  the  linearly  polarized  beam  and 
the  diffracted  circularly  polarized  beam,  passing  only  the  in  plane  diffracted 
beam.  This  allows  readout  of  the  grating  as  it  is  recorded.  The  diffracted 
light  is  imaged  onto  the  detector,  D.  This  system  is  to  classifies  input  pat¬ 
terns  presented  to  it  into  two  classes  according  to  whether  the  output  of 
the  detector  when  the  pattern  is  presented  is  high  or  low.  If  during  training 
a  pattern  we  would  like  to  classify  as  high  yields  a  low  response  then  the 
hologram  is  reinforced  by  exposing  the  crystal  to  the  interference  of  the  two 
beams,  each  carrying  the  image  of  that  pattern.  This  exposure  continues 
until  the  diffracted  output  increases  by  a  fixed  amount.  If  a  pattern  which 
should  be  classified  as  low  is  found  during  training  to  yield  a  diffracted  out¬ 
put  that  is  too  high  then  the  hologram  diffracting  that  pattern  is  erased  by 
a  fixed  amount  by  exposing  the  crystal  with  only  one  of  the  imaging  beams. 
(One  beam  is  blocked  by  the  shutter,  SH).  An  experimental  learning  curve 
showing  the  diffracted  intensities  for  each  learning  cycle  for  four  training 
patterns  in  a  system  implemented  using  an  Fe  doped  LiNb03  crystal  is 
shown  in  Fig.  6.  The  system  classifies  the  patterns  0  and  2  as  high  and  1 
and  3  as  low.  At  first  all  patterns  are  low.  The  first  two  learning  cycles 
are  intended  to  drive  the  outputs  of  0  and  2  above  threshold.  However, 
they  have  the  undesired  effect  of  also  driving  pattern  3  above  threshold. 
Thus  in  the  third  learning  cycle  3  is  erased.  In  this  particular  erase  cycle 
the  erasure  was  too  severe.  Notice  that  pattern  2  is  erased  in  this  cycle, 
even  though  there  is  no  overlap  between  this  pattern  and  pattern  3.  The 
reason  for  this  is  that  the  two  images  of  pattern  3  are  in  focus  only  over  a 
limited  region  of  the  crystal  volume.  Outside  of  this  region  the  unfocused 
image  may  erase  the  hologram  formed  by  pattern  2.  In  the  subsequent  two 
cycles  patterns  0  and  2  are  again  reinforced.  This  has  the  unwanted  effect 
of  driving  both  patterns  1  and  3  just  above  threshold.  In  the  final  two 
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cycles  patterns  1  and  3  are  erased  until  both  are  below  threshold.  At  this 
point  all  patterns  are  correctly  classified  and  learning  stops. 

In  this  experiment  the  photorefractive  crystal  acts  as  a  two  dimensional 
modulator.  The  diffraction  efficiency  between  the  two  imaging  paths  is 
high  where  the  patterns  0  and  2  overlap  and  low  where  patterns  3  and  1 
overlap.  As  mentioned  above,  a  problem  arises  in  the  fact  that  the  overlap 
is  well  defined  only  in  the  image  plane,  meaning  the  crystal  must  be  thinner 
than  the  depth  of  focus  of  the  images.  In  order  to  utilize  the  full  capacity 
of  photorefractive  volume  holograms  it  will  be  necessary  to  move  beyond 
this  implementation  to  architectures  utilizing  the  full  three  dimensional 
capacity  of  the  crystal  as  discussed  above.  Nevertheless,  this  experiment 
demonstrates  in  a  rudimentary  way  how  learning  in  photorefractive  crystals 
may  proceed. 

4.5  Conclusion 

Photorefractive  crystals  represent  a  promising  interconnection  technology 
for  optical  neural  computers.  The  ease  of  dynamic  holographic  modification 
of  interconnections  in  these  crystals  allows  the  implementation  of  a  large 
class  of  outer  product  learning  networks.  The  density  of  interconnections 
which  may  be  implemented  in  these  crystals  is  limited  by  physical  and 
geometrical  constraints  to  the  range  of  108  to  1010  per  cm3.  In  order  to 
achieve  these  limits  consideration  must  be  given  to  the  exposure  schedule 
of  the  crystal. 

References 

[1]  Y.  S.  Abu-Mostafa  and  D.  Psaltis,  Optical  Neural  Computers,  Scien¬ 
tific  American,  256(3), 88(1987). 

[2]  D.  Psaltis,  J.  Yu,  X.  G.  Gu,  and  H.  Lee,  Second  Topical  Meeting  on 
Optical  Computing,  Incline  Village,  Nevada,  March  16-18,1987. 

[3]  D.  Psaltis,  X.  G.  Gu,  H.  Lee,  and  J.  Yu,  Optical  Interconnections 
Implemented  with  Volume  Holograms,  to  be  published. 


4  P.  J.  Van  Heerden,  Theory  of  optical  information  storage  in  solids, 
Appl.  Opt.,  2, (4), 393(1963). 

5  M.  Cohen,  Design  of  a  new  medium  for  volume  holographic  information 
processing,  Appl.  Opt.,  25(14), 2288(1986). 

6  K.  Wagner  and  D.  Psaltis,  Multilayer  optical  learning  networks,  Proc. 
SPIE  752-16,(1987). 

7  K.  Wagner  and  D.  Psaltis,  Nonlinear  etalons  in  adaptive  optical  neural 
computers,  IEEE  First  Annual  International  Conference  on  Neural 
Networks,  San  Diego,  California,  June  21-24,  1987. 

•8]  K.  Wagner  and  D.  Psaltis,  Multilayer  optical  learning  networks,  to 
appear  in  Appl.  Opt. 

[9]  D.  Z.  Anderson,  Adaptable  interconnects  for  optical  neuromorphs: 
demonstration  of  a  photorefractive  projection  operator,  Proceedings 
of  the  International  Conference  on  Neural  Networks,  San  Diego,  June 
1987. 

[10]  T.  Kohonen,  Self-Organization  and  Associative  Memory ,  Springer-Verlag, 
Berlin,  (1984). 

[11]  J.  J.  Hopfield,  Neural  networks  and  physical  systems  with  emergent 
collective  computational  abilities,  Proc.  Natl.  Acad.  Sci.  USA, 
79,2554(1982). 

[12]  S.  S.  Venkatesh  and  D.  Psaltis,  Information  storage  and  retrieval  in 
two  associative  nets,  Conf.  on  neural  network  models  for  computing, 
Santa  Barbara,  California,  April  1985. 

[13]  L.  Personnaz,  I.  Guvon,  and  G.  Dreyfus,  Information  storage  and 
retrieval  in  spin-glass  like  neural  networks,  J.  Physique  Lett.,  46, 
L359(1985). 

[14  D.  Psaltis  and  C.  Park,  Nonlinear  discriminant  functions  and  associa¬ 
tive  memories,  Neural  Networks  for  Computing ,  APS  conference  proc. 
No.  151  (1986). 

78 


I. 

I 


T.  Maxwell,  C.  L.  Giles,  Y.  C.  Lee  and  H.  H.  Chen,  ISonhnear  dynam¬ 
ics  of  artificial  neural  systems,  Neural  Networks  for  Computing ,  APS 
conference  proc.  No.  151  (1986). 

E.  B.  Baum,  On  the  capabilities  of  multilayer  perceptrons,  to  be  pub¬ 
lished. 

D.  Psaltis  and  N.  H.  Farhat,  Optical  information  processing  based  on 
an  associative  memory  model  of  neural  nets  with  thresholding  and 
feedback,  Opt.  Lett.,  10,(2),  £8(1985). 

Y.  Owechko,  G.  J.  Dunning,  E.  Marom,  and  B.  H.  Soffer,  Holographic 
associative  memory  with  nonlinearities  in  the  correlation  domain.  Appl. 
Opt.  26, (10), 1900(1987). 

B.  Kosko  and  C.  Guest,  Optical  bidirectional  associative  memories, 
SPIE  Proc.  758,  Jan.  1987. 

R.  A.  Athale,  H.  H.  Szu,  and  C.  B.  Friedlander,  Optical  implementa¬ 
tion  of  associative  memory  with  controlled  nonlinearity  in  the  correla¬ 
tion  domain,  11, (7), 482(1986). 

F.  Rosenblatt,  Principles  of  Neurodynamics:  Perceptron  and  the  The¬ 
ory  of  Brain  Mechanisms,  Spartan  Books,  Washington, (1961). 

B.  Widrow  and  M.  E.  Hoff,  Adaptive  switching  circuits,  IRE  Wescon 
Conv.  Rec.,  pt.  4,96(1960). 

D.  E.  Rumelhart  and  J.  L.  McClelland  (Eds.),  Parallel  Distributed 
Processing,  Volume  1,  MIT  Press,  Cambridge,  (1986). 

D.  B.  Parker,  Learning  logic,  Invention  report,  S81-64,  File  1,  office  of 
Technology  Licensing,  Stanford  University,  October  1982. 

J.  S.  Denker  (Ed.),  Neural  Networks  for  Computing ,  APS  conference 
proc.  No.  151  (1986). 

A.  D.  Fisher,  R.  C.  Fukuda,  and  J.  N.  Lee,  Implementations  of  adaptive 
associative  optical  computing  elements,  Proc.  SPIE  625,  196(1986). 


fvVW 


[27]  K.  Fukushima,  A  hierarchical  neural  network  model  for  associative 
memory,  Biol.  Cybern.  50,105(1984). 

[28]  S.  Grossberg,  Studies  of  Mind  and  Brain,  D.  Reidel  Pub.  Co.,  Boston, 
(1982). 

[29]  N.  V.  Kuktarev,  V.  B.  Markov,  S.  G.  Odulov,  M.  S.  Soskin,  and  V.  L. 
Vinetskii,  Ferroelectrics,  22,949(1979). 

[30]  J.  W.  Goodman,  Fan-in  and  fan-out  with  optical  interconnections, 
Optica  Acta,  32, (12), 1489(1985). 

[31]  D.  L.  Staebler  and  W.  Phillips,  Fe-doped  LiNb03  for  read-write  appli¬ 
cations,  Appl.  Opt.,  13, (4), 788(1974). 


Figure  1.  General  neural  network  architecture. 
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Figure  2.  Two  layer  network  with  lateral  inhibition.  Connections  ending  with  an  open 
circle  are  inhibitory. 
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Figure  5.  Simple  photorefractive  learning  system.  PB  is  a  polarizing  beamsplitter.  Ll 
and  L2  are  imaging  lenses.  WP  is  a  quarter  waveplate.  SH  is  a  shutter.  P  is  a  polarizer. 
D  is  a  detector.  M  is  a  mirror. 
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Figure  6.  Experimental  learning  curves. 


